Edit report at https://bugs.php.net/bug.php?id=49705&edit=1
ID: 49705
Comment by: glen_scott at yahoo dot co dot uk
Reported by: lyngvi at gmail dot com
Summary: DOMDocument::loadHTML should have a way to override
charset
Status: Open
Type: Feature/Change Request
Package: DOM XML related
Operating System: linux
PHP Version: 5.3.0
Block user comment: N
Private report: N
New Comment:
To workaround this issue, you may want to use this extended DOMDocument which
allows you to specify the character encoding when loading HTML documents:
https://github.com/glenscott/dom-document-charset
Please let me know if it is of use.
Previous Comments:
------------------------------------------------------------------------
[2009-09-29 04:09:26] lyngvi at gmail dot com
Description:
------------
I propose that DOMDocument::loadHTML($data) be extended to
DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use
the same feature, though fixing the XML charset would be easier than HTML's.
Requiring the charset to be specified as a meta http-equiv content-type inside
the raw HTML data is clumsy, especially since HTML is often so poorly formed.
Generally I try to know my charset a priori, a good practice usually, but, in
this case, one that I am being punished for.
The situation I most recently came across was a in loading data off a site
serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8,
but the redundant meta http-equiv reporting charset iso-8859-1. See the repro
code below.
Ideally I could fix the serving site, I know. I can't in this case. Ideally,
there would be no famine and no war.
Thanks!
Reproduce code:
---------------
<?php
header("Content-Type: text/html; charset=utf-8");
$htmldata = <<<HTMLDATA
<HTMl><head><title>i our pooryl writn web page
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" />
</head >
<body>this is a utf8 apostrophe: </body>
</html>
HTMLDATA;
$doc = DOMDocument::loadHTML($htmldata);
echo $doc->getElementsByTagName("body")->item(0)->textContent;
?>
Expected result:
----------------
this is a utf8 apostrophe:
(the apostrophe shows up correctly - I don't want DOMDocument to mutilate my
text)
Actual result:
--------------
this is a utf8 apostrophe: ’
(I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that
is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2
80 c2 93))
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=49705&edit=1