Edit report at https://bugs.php.net/bug.php?id=63430&edit=1

 ID:                 63430
 User updated by:    lussenburg_rm at hotmail dot com
 Reported by:        lussenburg_rm at hotmail dot com
 Summary:            xml data parsing bug
-Status:             Open
+Status:             Closed
 Type:               Bug
 Package:            XML Reader
 Operating System:   windows 7
 PHP Version:        Irrelevant
 Block user comment: N
 Private report:     N

 New Comment:

.


Previous Comments:
------------------------------------------------------------------------
[2012-11-21 11:32:16] lussenburg_rm at hotmail dot com

That does work indeed, thanks. I guess i misunderstood the explanation of 
next(). i didn't expect it to skip over the beginning <tag> of a new element. i 
thougt it would only skip over all subtrees of the current element, and that 
the read at the top of the loop would start at the <item> element.

Compliments on the 'super fast' reply also !

------------------------------------------------------------------------
[2012-11-20 21:44:29] mail+php at requinix dot net

Hate to burst your bubble but there's a flaw in your code. The problem occurs 
when
* There is a node before an <item> with no whitespace (ie, a #text) in between
* Said node has children
* Said node has an entry in $siblings

The last two cause a line of code near the bottom

if ( $node->hasChildNodes() && ($mode == 1 || $siblings[$node->nodeName]) )
  $xml->next();

to fire. next() will skip over the rest of the node and, in lieu of a 
subsequent 
#text, advance to the <item>. But at the top of your loop you have a read(). 
That 
will skip over the tag and into the following #text (between the <item> and the 
<title>). You can confirm this by outputting the node name at the beginning of 
the 
loop - before the switch that would skip over it: <image>, then #text, then 
<title>.

It works for me if I change the while loop into a do/while:
* $xml->read() before the loop to initialize
* flag=false at the start of the loop
* the aforementioned line sets flag=$xml->next()
* do/while ( flag || $xml->read() )

If you'd like to know more you can email me at this address.

------------------------------------------------------------------------
[2012-11-20 20:30:51] lussenburg_rm at hotmail dot com

Hi there,

This code is for testing purposes so i could learn how XMLReader() works before 
incorporating it in a RssWebfeed class i've written.
In this code the only thing i replace, to work around the bug i got, is the bit 
that is commented out in this example. 'nosnieuwsalgemeen.xml' is the file I 
have saved on my pc so i don't have to read it from internet everytime. It is 
the contents of http://feeds.nos.nl/nosnieuwsalgemeen. Another example is 
http://www.nasa.gov/rss/breaking_news.rss, but this one doesn't give the bug.
In the implementation, I need to get the data that comes before the first 
<item> into a feed database which identifies different feed id's and its title 
and description. When i encounter the first <item> these are records that go 
into a 2nd database which defines items for a particular feed.


Here's the code:


/*
$find = array (
        '<![CDATA[', ']]>', '><item>'
);
$repl = array (
        '',          '',    '>\r\n<item>'
);
*/

$file = 'nasa_breaking_news.xml';

$cont = file_get_contents($file);
//$cont = str_ireplace($find, $repl, $cont);

$nodes = array (
        'rss'            => array( 'version' => 'rss_version' ),
        'guid'           => true,
        'link'           => true,
        'title'          => true,
        'description'    => true,
        'pubDate'        => true,
        'lastBuildDate'  => true,
        'language'       => true,
        'image'          => true,
        'enclosure'      => array( 'url' => 'enclosure', 'type' => 'type', 
'width' => 'imgwidth' ),
        'managingEditor' => true,
        'related'        => true,
);

$siblings = array (
        'image' => array( 'url' => 'image', 'title' => 'alt', 'link' => 'link', 
'description' => 'title' ),
);

$xml = new XMLReader();

if ( $xml ) {
        echo '
        <div class="e large">xml = new XMLReader()</div>
        <div>gelukt</div>
        <br>';
}

if ( $xml->xml($cont, THIS_CHARSET, LIBXML_NOERROR|LIBXML_NOWARNING) === true ) 
{
        printf( '
        <div class="e large">xml->open()</div>
        <div>%s</div>
        <br>',
        $file
        );

        echo '
        <br>';

        $mode        = 0;
        $element     = '';
        $itemcount   = 0;

        while ( $xml->read() ) {

                if ( $xml->name == 'item' ) {
                        switch ( $xml->nodeType ) {
                        case XMLReader::ELEMENT:
                                $itemcount++;
                                $mode = 1;
                                break;
                        case XMLReader::END_ELEMENT:
                                $mode = 0;
                                break;
                        }
                }

                $element = '';

                switch ( $xml->nodeType ) {
                case XMLReader::END_ELEMENT:
                case XMLReader::SIGNIFICANT_WHITESPACE:
                case XMLReader::WHITESPACE:
                case XMLReader::TEXT:
                case XMLReader::CDATA:
                        continue 2;
                }

                printf( '
                <br>
                <div style="padding-left:%uem;">
                <div class="e large">xml->read():</div>
                <div>xml->name: %s%s</div>
                <div>xml->nodeType: %d</div>
                <div>xml->isEmpty: %s</div>
                <div>xml->hasvalue: %s</div>
                <div>xml->attr: %s</div>
                <div>xml->depth: %d</div>',
                $mode+1,
                $xml->name,
                $xml->name=='item' ? sprintf(' (rec#: %u)', $itemcount) : '',
                $xml->nodeType,
                $xml->isEmptyElement ? "yes" : "no",
                $xml->hasValue ? "yes" : "no",
                $xml->hasAttributes ? $xml->attributeCount : "no",
                $xml->depth
                );

                if ( !$nodes[$xml->name] ) {
                        echo '
                        </div>';
                        continue;
                }

                switch ( $xml->nodeType ) {
                case XMLReader::ELEMENT:
                        $element = $xml->name;
                        printf( '
                        <div%s>',
                        $nodes[$xml->name] ? ' class="grey"' : ''
                        );
                        if ( $nodes[$xml->name] === true ) {
                                printf( '
                                <div>INNER: %s</div>',
                                $xml->readInnerXML()
                                );
                        }
                        if ( $node = $xml->expand() ) {
                                printf( '
                                <div>node->name: %s</div>',
                                $node->nodeName
                                );
                                printf( '
                                <div>node->childs: %s</div>',
                                $node->hasChildNodes() ? 
"".$node->childNodes->length : "no"
                                );
                                if ( $xml->hasAttributes && $node->attributes 
!== null ) {
                                        echo '
                                        <div>node->attr: ';
                                        for ( $i = 0; $i < 
$xml->attributeCount; $i++ ) {
                                                $item = 
$node->attributes->item($i);
                                                if ( 
$nodes[$xml->name][$item->nodeName] ) printf('[%s=%s]', 
$nodes[$xml->name][$item->nodeName], $item->nodeValue);
                                        }
                                        echo '
                                        </div>';
                                }
                                if ( $node->hasChildNodes() && 
$siblings[$node->nodeName] ) {
                                        echo '<div>node->items:';
                                        for ( $i = 0; $i < 
$node->childNodes->length - 1; $i++ ) {
                                                $item = 
$node->childNodes->item($i);
                                                if ( $item->nodeType == 
XMLReader::ELEMENT && $siblings[$node->nodeName][$item->nodeName]) {
                                                        echo 
'['.$siblings[$node->nodeName][$item->nodeName].'='.$item->nodeValue.']';
                                                }
                                        }
                                        echo '</div>';
                                }
                                if ( $node->hasChildNodes() && ($mode == 1 || 
$siblings[$node->nodeName]) ) $xml->next();
                        }
                        echo '
                        </div>';
                        break;
                }
                echo '
                </div>';
        }

        $ret = $xml->close();

        printf( '
        <br>
        <div class="bordertop">
        <div class="e large">xml->close():</div>
        <div>%sgelukt</div>
        </div>',
        $ret===false ? 'niet ' : ''
        );

}

------------------------------------------------------------------------
[2012-11-07 19:50:22] mail+php at requinix dot net

Even if the input is "faulty" example code is still important. For all we know 
it's a complex problem you're triggering because of something subtle in your 
code.

I can't reproduce it with

<?php
$xml = <<<XML
<rss>
 <channel>
  <title>feed title</title>
  <description>feed description</description>
  <pubDate>Mon, 29 Oct 2012 13:30:00 +0100</pubDate><item>
    <title>item title</title>
    <description>item description</description>
    <link>itemlink</link>
  </item>
 </channel>
</rss>
XML;

$reader = new XMLReader();
$reader->xml($xml);

// http://www.php.net/manual/en/class.xmlreader.php#88264
function xml2assoc($xml) { removed for brevity }

print_r(xml2assoc($reader));
?>

PHP 5.4.3 and libxml 2.7.7

------------------------------------------------------------------------
[2012-11-03 17:23:12] lussenburg_rm at hotmail dot com

Description:
------------
---
>From manual page: 
>http://www.php.net/xmlreader.read#refsect1-xmlreader.read-description
---
The bug isn't realy in the code so im not including any script here, but it is 
related to the xml input. For example i'm reading some rss feeds (note that i 
neither compose, nor responsible for the layout) that look like this:

<rss>
 <channel>
  <title>feed title</title>
  <description>feed description</description>
  <pubDate>Mon, 29 Oct 2012 13:30:00 +0100</pubDate>
  <item>
    <title>item title</title>
    <description>item description</description>
    <link>http://itemlink</link>
  </item>
  <item>
    <title>item title</title>
    <description>item description</description>
    <link>http://bla</link>
  </item>
  ...
 </channel>
</rss>

Everything was working perfectly fine until i kept getting values from the 
first 'item title' and 'item description' in the 'feed title' and 'feed 
description' node values. When i examined the xml data i found out that it only 
happens when the first <item> tag directly follows the last of the <channel> 
nodes (<title>, <description>, <pubDate> etc) without a carriage return/newline.
To work around this, before passing the data to XMLReader::xml(), i replace all 
occurences of "><item>" with ">\r\n<item>", which works fine, but maybe it 
could be resolved so this workaround isn't neccesary anymore.




------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=63430&edit=1

Reply via email to