Edit report at https://bugs.php.net/bug.php?id=49705&edit=1
ID: 49705 Comment by: glen_scott at yahoo dot co dot uk Reported by: lyngvi at gmail dot com Summary: DOMDocument::loadHTML should have a way to override charset Status: Open Type: Feature/Change Request Package: DOM XML related Operating System: linux PHP Version: 5.3.0 Block user comment: N Private report: N New Comment: To workaround this issue, you may want to use this extended DOMDocument which allows you to specify the character encoding when loading HTML documents: https://github.com/glenscott/dom-document-charset Please let me know if it is of use. Previous Comments: ------------------------------------------------------------------------ [2009-09-29 04:09:26] lyngvi at gmail dot com Description: ------------ I propose that DOMDocument::loadHTML($data) be extended to DOMDocument::loadHTML($data, $forceCharset=null); loadXML might be able to use the same feature, though fixing the XML charset would be easier than HTML's. Requiring the charset to be specified as a meta http-equiv content-type inside the raw HTML data is clumsy, especially since HTML is often so poorly formed. Generally I try to know my charset a priori, a good practice usually, but, in this case, one that I am being punished for. The situation I most recently came across was a in loading data off a site serving proper UTF-8 data, with *HTTP* content-type text/html charset utf-8, but the redundant meta http-equiv reporting charset iso-8859-1. See the repro code below. Ideally I could fix the serving site, I know. I can't in this case. Ideally, there would be no famine and no war. Thanks! Reproduce code: --------------- <?php header("Content-Type: text/html; charset=utf-8"); $htmldata = <<<HTMLDATA <HTMl><head><title>i our pooryl writn web page <meta http-equiv="content-type" content="text/html; charset=iso-8859-1;" /> </head > <body>this is a utf8 apostrophe: </body> </html> HTMLDATA; $doc = DOMDocument::loadHTML($htmldata); echo $doc->getElementsByTagName("body")->item(0)->textContent; ?> Expected result: ---------------- this is a utf8 apostrophe: (the apostrophe shows up correctly - I don't want DOMDocument to mutilate my text) Actual result: -------------- this is a utf8 apostrophe: ’ (I get a with a ^ on top, and the illegal characters \u0080 and \u0099 - that is, loadHTML re-encoded \u2019 (e2 80 99) to get \u00e2 \u0080 \u0099 (c3 a2 c2 80 c2 93)) ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=49705&edit=1