ID: 39269 Updated by: [EMAIL PROTECTED] Reported By: arturm at union dot com dot pl -Status: Open +Status: Bogus Bug Type: DOM XML related Operating System: Windows PHP Version: 5.1.6 New Comment:
The answer is in the very first user note of DOMDocument->loadHTML(): http://php.net/manual/en/function.dom-domdocument-loadhtml.php You must specify the character set in <HEAD> tag to be used by libxml2. We can't change this behaviour, as this is how libxml2 works. Previous Comments: ------------------------------------------------------------------------ [2006-10-26 17:23:27] arturm at union dot com dot pl Sorry, charset on bugs.php.net is not UTF-8. Please follow an original thread on pl.comp.lang.php for source code: http://groups.google.pl/group/pl.comp.lang.php/browse_frm/thread/e0de8a41d687aef3/d2c602e5ac1d40cb?hl=pl#d2c602e5ac1d40cb ------------------------------------------------------------------------ [2006-10-26 17:17:56] arturm at union dot com dot pl Description: ------------ If you load HTML using DOM::loadHTML() wrong charset is used when non US-ASCII characters are used in source before charset declaration in meta tag. Reproduce code: --------------- <?php header("Content-type: text/plain; charset=UTF-8"); $doc = new DOMDocument(); $doc->loadHTML('<title>ą</title>' .'<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">' .'<p>ąęółść</p>'); echo $doc->encoding; echo $doc->textContent; ?> Expected result: ---------------- UTF-8ąęółść Actual result: -------------- UTF-8ąąęółść ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=39269&edit=1