Edit report at https://bugs.php.net/bug.php?id=49705&edit=1

 ID:                 49705
 Comment by:         glen_scott at yahoo dot co dot uk
 Reported by:        lyngvi at gmail dot com
 Summary:            DOMDocument::loadHTML should have a way to override
                     charset
 Status:             Open
 Type:               Feature/Change Request
 Package:            DOM XML related
 Operating System:   linux
 PHP Version:        5.3.0
 Block user comment: N
 Private report:     N

 New Comment:

To workaround this issue, you may want to use this extended DOMDocument which 
allows you to specify the character encoding when loading HTML documents:

https://github.com/glenscott/dom-document-charset

Please let me know if it is of use.


Previous Comments:
------------------------------------------------------------------------
[2009-09-29 04:09:26] lyngvi at gmail dot com

Description:
------------
I propose that DOMDocument::loadHTML($data) be extended to 
DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use 
the same feature, though fixing the XML charset would be easier than HTML's.

Requiring the charset to be specified as a meta http-equiv content-type inside 
the raw HTML data is clumsy, especially since HTML is often so poorly formed. 
Generally I try to know my charset a priori, a good practice usually, but, in 
this case, one that I am being punished for.

The situation I most recently came across was a in loading data off a site 
serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, 
but the redundant meta http-equiv reporting charset iso-8859-1. See the repro 
code below.

Ideally I could fix the serving site, I know. I can't in this case. Ideally, 
there would be no famine and no war.

Thanks!

Reproduce code:
---------------
<?php

header("Content-Type: text/html; charset=utf-8");

$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" />
</head >
<body>this is a utf8 apostrophe: ’</body>
</html>
HTMLDATA;

$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;

?>



Expected result:
----------------
this is a utf8 apostrophe: ’
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate my 
text)

Actual result:
--------------
this is a utf8 apostrophe: â&#128;&#153;
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that 
is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 
80 c2 93))


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=49705&edit=1

Reply via email to