Edit report at http://bugs.php.net/bug.php?id=47502&edit=1

 ID:               47502
 Updated by:       fel...@php.net
 Reported by:      grodny at oneclick dot sk
 Summary:          xml_get_current_byte_index inside character data
                   handler returns wrong offset
-Status:           Open
+Status:           Assigned
 Type:             Bug
 Package:          XML related
 Operating System: Windows
 PHP Version:      5.3CVS-2009-02-25 (snap)
-Assigned To:      
+Assigned To:      rrichards



Previous Comments:
------------------------------------------------------------------------
[2009-02-26 09:13:07] grodny at oneclick dot sk

Possible solution could be introduction of second optional argument and
thus enhacing current functionality and keeping backward compatibility.



xml_get_current_byte_index (resource $parser [, int $position=0])



position:

  If 0, keep current behaviour for backward compatibility.

  If -1, return index of first byte of node

    (for start element or PI it is '<', for text node it is first byte
of $data string passed to handler, etc.)

  If 1, return index after last byte of node.

    (for start element or PI byte index after '>', for text node after
last byte of $data string passed to handler)

------------------------------------------------------------------------
[2009-02-25 15:56:11] grodny at oneclick dot sk

Description:
------------
Byte index returned by xml_get_current_byte_index() call in character
data handler, points to different locations of XML source, based on
character data being parsed.



If parsed string passed as second argument to handler starts with ASCII
non-white space character, byte index is offset to location before
parsed string.



If parsed string starts with white space, or UTF-8 character, it points
after parsed string.



To keep consistency with other handlers, it should return offset to
location after parsed string, in all cases.



Reproduce code:
---------------
$xml = '<R><N>before</N><N>'

        .html_entity_decode('&sect;', ENT_COMPAT, 'UTF-8')

        .'after</N><N> after</N>before </R>';



function cdata ($p, $cdata) {

  global $xml;



  $off = xml_get_current_byte_index($p);



  echo 'CDATA: "',

    htmlentities($cdata, ENT_COMPAT, 'UTF-8'), '"', PHP_EOL,

    'AFTER-INDEX: "',

    htmlentities(substr($xml, $off), ENT_COMPAT, 'UTF-8'), '"',

    PHP_EOL;

}



$p = xml_parser_create('UTF-8');

xml_set_character_data_handler($p, 'cdata');

xml_parse($p, $xml, true);

xml_parser_free($p);



Expected result:
----------------
CDATA: "before"

AFTER-INDEX: "</N><N>§after</N><N> after</N>before </R>"

CDATA: "§after"

AFTER-INDEX: "</N><N> after</N>before </R>"

CDATA: " after"

AFTER-INDEX: "</N>before </R>"

CDATA: "before "

AFTER-INDEX: "</R>"



Actual result:
--------------
CDATA: "before"

AFTER-INDEX: "before</N><N>§after</N><N> after</N>before </R>"

CDATA: "§after"

AFTER-INDEX: "</N><N> after</N>before </R>"

CDATA: " after"

AFTER-INDEX: "</N>before </R>"

CDATA: "before "

AFTER-INDEX: "before </R>"


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=47502&edit=1

Reply via email to