Bug #63430 [Opn->Csd]: xml data parsing bug
Edit report at https://bugs.php.net/bug.php?id=63430&edit=1 ID: 63430 User updated by:lussenburg_rm at hotmail dot com Reported by:lussenburg_rm at hotmail dot com Summary:xml data parsing bug -Status: Open +Status: Closed Type: Bug Package:XML Reader Operating System: windows 7 PHP Version:Irrelevant Block user comment: N Private report: N New Comment: . Previous Comments: [2012-11-21 11:32:16] lussenburg_rm at hotmail dot com That does work indeed, thanks. I guess i misunderstood the explanation of next(). i didn't expect it to skip over the beginning of a new element. i thougt it would only skip over all subtrees of the current element, and that the read at the top of the loop would start at the element. Compliments on the 'super fast' reply also ! [2012-11-20 21:44:29] mail+php at requinix dot net Hate to burst your bubble but there's a flaw in your code. The problem occurs when * There is a node before an with no whitespace (ie, a #text) in between * Said node has children * Said node has an entry in $siblings The last two cause a line of code near the bottom if ( $node->hasChildNodes() && ($mode == 1 || $siblings[$node->nodeName]) ) $xml->next(); to fire. next() will skip over the rest of the node and, in lieu of a subsequent #text, advance to the . But at the top of your loop you have a read(). That will skip over the tag and into the following #text (between the and the ). You can confirm this by outputting the node name at the beginning of the loop - before the switch that would skip over it: , then #text, then . It works for me if I change the while loop into a do/while: * $xml->read() before the loop to initialize * flag=false at the start of the loop * the aforementioned line sets flag=$xml->next() * do/while ( flag || $xml->read() ) If you'd like to know more you can email me at this address. ---------------- [2012-11-20 20:30:51] lussenburg_rm at hotmail dot com Hi there, This code is for testing purposes so i could learn how XMLReader() works before incorporating it in a RssWebfeed class i've written. In this code the only thing i replace, to work around the bug i got, is the bit that is commented out in this example. 'nosnieuwsalgemeen.xml' is the file I have saved on my pc so i don't have to read it from internet everytime. It is the contents of http://feeds.nos.nl/nosnieuwsalgemeen. Another example is http://www.nasa.gov/rss/breaking_news.rss, but this one doesn't give the bug. In the implementation, I need to get the data that comes before the first into a feed database which identifies different feed id's and its title and description. When i encounter the first these are records that go into a 2nd database which defines items for a particular feed. Here's the code: /* $find = array ( '', '>' ); $repl = array ( '', '','>\r\n' ); */ $file = 'nasa_breaking_news.xml'; $cont = file_get_contents($file); //$cont = str_ireplace($find, $repl, $cont); $nodes = array ( 'rss'=> array( 'version' => 'rss_version' ), 'guid' => true, 'link' => true, 'title' => true, 'description'=> true, 'pubDate'=> true, 'lastBuildDate' => true, 'language' => true, 'image' => true, 'enclosure' => array( 'url' => 'enclosure', 'type' => 'type', 'width' => 'imgwidth' ), 'managingEditor' => true, 'related'=> true, ); $siblings = array ( 'image' => array( 'url' => 'image', 'title' => 'alt', 'link' => 'link', 'description' => 'title' ), ); $xml = new XMLReader(); if ( $xml ) { echo ' xml = new XMLReader() gelukt '; } if ( $xml->xml($cont, THIS_CHARSET, LIBXML_NOERROR|LIBXML_NOWARNING) === true ) { printf( ' xml->open() %s ', $file ); echo ' '; $mode= 0; $element = ''; $itemcount = 0; while ( $xml->read() ) { if ( $xml->name ==
Bug #63430 [Opn]: xml data parsing bug
Edit report at https://bugs.php.net/bug.php?id=63430&edit=1 ID: 63430 User updated by:lussenburg_rm at hotmail dot com Reported by:lussenburg_rm at hotmail dot com Summary:xml data parsing bug Status: Open Type: Bug Package:XML Reader Operating System: windows 7 PHP Version:Irrelevant Block user comment: N Private report: N New Comment: That does work indeed, thanks. I guess i misunderstood the explanation of next(). i didn't expect it to skip over the beginning of a new element. i thougt it would only skip over all subtrees of the current element, and that the read at the top of the loop would start at the element. Compliments on the 'super fast' reply also ! Previous Comments: [2012-11-20 21:44:29] mail+php at requinix dot net Hate to burst your bubble but there's a flaw in your code. The problem occurs when * There is a node before an with no whitespace (ie, a #text) in between * Said node has children * Said node has an entry in $siblings The last two cause a line of code near the bottom if ( $node->hasChildNodes() && ($mode == 1 || $siblings[$node->nodeName]) ) $xml->next(); to fire. next() will skip over the rest of the node and, in lieu of a subsequent #text, advance to the . But at the top of your loop you have a read(). That will skip over the tag and into the following #text (between the and the ). You can confirm this by outputting the node name at the beginning of the loop - before the switch that would skip over it: , then #text, then . It works for me if I change the while loop into a do/while: * $xml->read() before the loop to initialize * flag=false at the start of the loop * the aforementioned line sets flag=$xml->next() * do/while ( flag || $xml->read() ) If you'd like to know more you can email me at this address. ---------------- [2012-11-20 20:30:51] lussenburg_rm at hotmail dot com Hi there, This code is for testing purposes so i could learn how XMLReader() works before incorporating it in a RssWebfeed class i've written. In this code the only thing i replace, to work around the bug i got, is the bit that is commented out in this example. 'nosnieuwsalgemeen.xml' is the file I have saved on my pc so i don't have to read it from internet everytime. It is the contents of http://feeds.nos.nl/nosnieuwsalgemeen. Another example is http://www.nasa.gov/rss/breaking_news.rss, but this one doesn't give the bug. In the implementation, I need to get the data that comes before the first into a feed database which identifies different feed id's and its title and description. When i encounter the first these are records that go into a 2nd database which defines items for a particular feed. Here's the code: /* $find = array ( '', '>' ); $repl = array ( '', '','>\r\n' ); */ $file = 'nasa_breaking_news.xml'; $cont = file_get_contents($file); //$cont = str_ireplace($find, $repl, $cont); $nodes = array ( 'rss'=> array( 'version' => 'rss_version' ), 'guid' => true, 'link' => true, 'title' => true, 'description'=> true, 'pubDate'=> true, 'lastBuildDate' => true, 'language' => true, 'image' => true, 'enclosure' => array( 'url' => 'enclosure', 'type' => 'type', 'width' => 'imgwidth' ), 'managingEditor' => true, 'related'=> true, ); $siblings = array ( 'image' => array( 'url' => 'image', 'title' => 'alt', 'link' => 'link', 'description' => 'title' ), ); $xml = new XMLReader(); if ( $xml ) { echo ' xml = new XMLReader() gelukt '; } if ( $xml->xml($cont, THIS_CHARSET, LIBXML_NOERROR|LIBXML_NOWARNING) === true ) { printf( ' xml->open() %s ', $file ); echo ' '; $mode= 0; $element = ''; $itemcount = 0; while ( $xml->read() ) { if ( $xml->name == 'item' ) { switch ( $xml->nodeType ) { case XMLReader::ELEMENT:
Bug #63430 [Opn]: xml data parsing bug
Edit report at https://bugs.php.net/bug.php?id=63430&edit=1 ID: 63430 User updated by:lussenburg_rm at hotmail dot com Reported by:lussenburg_rm at hotmail dot com Summary:xml data parsing bug Status: Open Type: Bug Package:XML Reader Operating System: windows 7 PHP Version:Irrelevant Block user comment: N Private report: N New Comment: Hi there, This code is for testing purposes so i could learn how XMLReader() works before incorporating it in a RssWebfeed class i've written. In this code the only thing i replace, to work around the bug i got, is the bit that is commented out in this example. 'nosnieuwsalgemeen.xml' is the file I have saved on my pc so i don't have to read it from internet everytime. It is the contents of http://feeds.nos.nl/nosnieuwsalgemeen. Another example is http://www.nasa.gov/rss/breaking_news.rss, but this one doesn't give the bug. In the implementation, I need to get the data that comes before the first into a feed database which identifies different feed id's and its title and description. When i encounter the first these are records that go into a 2nd database which defines items for a particular feed. Here's the code: /* $find = array ( '', '>' ); $repl = array ( '', '','>\r\n' ); */ $file = 'nasa_breaking_news.xml'; $cont = file_get_contents($file); //$cont = str_ireplace($find, $repl, $cont); $nodes = array ( 'rss'=> array( 'version' => 'rss_version' ), 'guid' => true, 'link' => true, 'title' => true, 'description'=> true, 'pubDate'=> true, 'lastBuildDate' => true, 'language' => true, 'image' => true, 'enclosure' => array( 'url' => 'enclosure', 'type' => 'type', 'width' => 'imgwidth' ), 'managingEditor' => true, 'related'=> true, ); $siblings = array ( 'image' => array( 'url' => 'image', 'title' => 'alt', 'link' => 'link', 'description' => 'title' ), ); $xml = new XMLReader(); if ( $xml ) { echo ' xml = new XMLReader() gelukt '; } if ( $xml->xml($cont, THIS_CHARSET, LIBXML_NOERROR|LIBXML_NOWARNING) === true ) { printf( ' xml->open() %s ', $file ); echo ' '; $mode= 0; $element = ''; $itemcount = 0; while ( $xml->read() ) { if ( $xml->name == 'item' ) { switch ( $xml->nodeType ) { case XMLReader::ELEMENT: $itemcount++; $mode = 1; break; case XMLReader::END_ELEMENT: $mode = 0; break; } } $element = ''; switch ( $xml->nodeType ) { case XMLReader::END_ELEMENT: case XMLReader::SIGNIFICANT_WHITESPACE: case XMLReader::WHITESPACE: case XMLReader::TEXT: case XMLReader::CDATA: continue 2; } printf( ' xml->read(): xml->name: %s%s xml->nodeType: %d xml->isEmpty: %s xml->hasvalue: %s xml->attr: %s xml->depth: %d', $mode+1, $xml->name, $xml->name=='item' ? sprintf(' (rec#: %u)', $itemcount) : '', $xml->nodeType, $xml->isEmptyElement ? "yes" : "no", $xml->hasValue ? "yes" : "no", $xml->hasAttributes ? $xml->attributeCount : "no", $xml->depth ); if ( !$nodes[$xml->name] ) { echo ' '; continue; } switch ( $xml->nodeType ) {
[PHP-BUG] Bug #63430 [NEW]: xml data parsing bug
From: lussenburg_rm at hotmail dot com Operating system: windows 7 PHP version: Irrelevant Package: XML Reader Bug Type: Bug Bug description:xml data parsing bug Description: --- >From manual page: http://www.php.net/xmlreader.read#refsect1-xmlreader.read-description --- The bug isn't realy in the code so im not including any script here, but it is related to the xml input. For example i'm reading some rss feeds (note that i neither compose, nor responsible for the layout) that look like this: feed title feed description Mon, 29 Oct 2012 13:30:00 +0100 item title item description http://itemlink item title item description http://bla ... Everything was working perfectly fine until i kept getting values from the first 'item title' and 'item description' in the 'feed title' and 'feed description' node values. When i examined the xml data i found out that it only happens when the first tag directly follows the last of the nodes (, , etc) without a carriage return/newline. To work around this, before passing the data to XMLReader::xml(), i replace all occurences of ">" with ">\r\n", which works fine, but maybe it could be resolved so this workaround isn't neccesary anymore. -- Edit bug report at https://bugs.php.net/bug.php?id=63430&edit=1 -- Try a snapshot (PHP 5.4): https://bugs.php.net/fix.php?id=63430&r=trysnapshot54 Try a snapshot (PHP 5.3): https://bugs.php.net/fix.php?id=63430&r=trysnapshot53 Try a snapshot (trunk): https://bugs.php.net/fix.php?id=63430&r=trysnapshottrunk Fixed in SVN: https://bugs.php.net/fix.php?id=63430&r=fixed Fixed in release: https://bugs.php.net/fix.php?id=63430&r=alreadyfixed Need backtrace: https://bugs.php.net/fix.php?id=63430&r=needtrace Need Reproduce Script: https://bugs.php.net/fix.php?id=63430&r=needscript Try newer version: https://bugs.php.net/fix.php?id=63430&r=oldversion Not developer issue:https://bugs.php.net/fix.php?id=63430&r=support Expected behavior: https://bugs.php.net/fix.php?id=63430&r=notwrong Not enough info: https://bugs.php.net/fix.php?id=63430&r=notenoughinfo Submitted twice: https://bugs.php.net/fix.php?id=63430&r=submittedtwice register_globals: https://bugs.php.net/fix.php?id=63430&r=globals PHP 4 support discontinued: https://bugs.php.net/fix.php?id=63430&r=php4 Daylight Savings: https://bugs.php.net/fix.php?id=63430&r=dst IIS Stability: https://bugs.php.net/fix.php?id=63430&r=isapi Install GNU Sed:https://bugs.php.net/fix.php?id=63430&r=gnused Floating point limitations: https://bugs.php.net/fix.php?id=63430&r=float No Zend Extensions: https://bugs.php.net/fix.php?id=63430&r=nozend MySQL Configuration Error: https://bugs.php.net/fix.php?id=63430&r=mysqlcfg