ID: 49984 User updated by: ppass at hotmail dot fr Reported By: ppass at hotmail dot fr Status: Bogus Bug Type: DOM XML related Operating System: Linux ns1 2.6.28.4-rsbac PHP Version: 5.2.11 New Comment:
This is still an open topic for me since there seems no easy way to implement in PHP their suggestion (adding the HTML_PARSE_RECOVER option when creating the parsing context). Is this something that can be done in PHP and how? Please advise, otherwise the subject remains open. Previous Comments: ------------------------------------------------------------------------ [2009-11-02 17:45:51] ppass at hotmail dot fr The reply from the libxml2 team is to try to add the HTML_PARSE_RECOVER option when creating the parsing context. I have no idea what that means. Does anybody know how this can be done from PHP code? ------------------------------------------------------------------------ [2009-11-02 13:46:20] ppass at hotmail dot fr That you for details, I just filed a bug in their system. ------------------------------------------------------------------------ [2009-11-02 06:53:03] ras...@php.net We didn't write the DOM implementation. We are simply using libxml2. Information on how to file a bug against libxml2 is here: http://xmlsoft.org/bugs.html But I suspect they won't consider this a bug. Their relaxed html parser isn't a full html parser that knows about embedded script objects. This would only be a PHP bug if we are somehow calling libxml2 incorrectly causing this, but it doesn't appear to be the case here. ------------------------------------------------------------------------ [2009-11-02 05:42:09] ppass at hotmail dot fr No reaction still to this bug. Maybe my previous title was too specific. More generally speaking, it means that the DOM model is broken in php when ever a script tag contains other tags in its text. This is a serious bug that must be corrected asap, other wise it is not possible to make a reliable use of DOM. ------------------------------------------------------------------------ [2009-10-24 04:27:57] ppass at hotmail dot fr Description: ------------ The script node's parent is a div. The script node has the text '</div>' inside its script. The DOM node returns only partial contents of the script node, as if the node was mistakenly truncated when reaching the '</div>' text. Reproduce code: --------------- $html = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"><title>Title</title></head><body><div><script type="text/javascript" id="script1">function dummy { object.innerHTML="<div>text</div>"; } function dummy2 { alert("hello"); } </script> </div> </body> </html>'; $dom = new DOMDocument('1.0', 'UTF-8'); @$dom->loadHTML($html); $script_node = $dom->getElementById('script1'); Echo "<![CDATA[$script_node->nodeValue]]>"; Expected result: ---------------- function dummy { object.innerHTML="<div>text</div>"; } function dummy2 { alert("hello"); } I expect to see the whole content of the script node. Actual result: -------------- function dummy { object.innerHTML="<div>text The script node has been truncated. ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=49984&edit=1