From: thomas dot koch at ymc dot ch Operating system: Debian Lenny PHP version: 5.2.9 PHP Bug Type: XML related Bug description: no option to set HTML input encoding
Description: ------------ Enhancement request. I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option. libxml2 honors the content-type meta tag, but this may not always be present. How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same? One could look in libxml2/HTMLparser.c#5580, function htmlCreateFileParserCtxt(const char *filename, const char *encoding) There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream. Reproduce code: --------------- <?php $html = <<<EOT <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <!--meta http-equiv="content-type" content="text/html; charset=utf-8" --> </head> <body id="umlaut">süÃ</body> </html> EOT; $dom = new DOMDocument; var_dump( $dom->loadHTML( $html ) ); $elem = $dom->getElementById( 'umlaut' ); echo $elem->textContent; -- Edit bug report at http://bugs.php.net/?id=47875&edit=1 -- Try a CVS snapshot (PHP 5.2): http://bugs.php.net/fix.php?id=47875&r=trysnapshot52 Try a CVS snapshot (PHP 5.3): http://bugs.php.net/fix.php?id=47875&r=trysnapshot53 Try a CVS snapshot (PHP 6.0): http://bugs.php.net/fix.php?id=47875&r=trysnapshot60 Fixed in CVS: http://bugs.php.net/fix.php?id=47875&r=fixedcvs Fixed in CVS and need be documented: http://bugs.php.net/fix.php?id=47875&r=needdocs Fixed in release: http://bugs.php.net/fix.php?id=47875&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=47875&r=needtrace Need Reproduce Script: http://bugs.php.net/fix.php?id=47875&r=needscript Try newer version: http://bugs.php.net/fix.php?id=47875&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=47875&r=support Expected behavior: http://bugs.php.net/fix.php?id=47875&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=47875&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=47875&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=47875&r=globals PHP 4 support discontinued: http://bugs.php.net/fix.php?id=47875&r=php4 Daylight Savings: http://bugs.php.net/fix.php?id=47875&r=dst IIS Stability: http://bugs.php.net/fix.php?id=47875&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=47875&r=gnused Floating point limitations: http://bugs.php.net/fix.php?id=47875&r=float No Zend Extensions: http://bugs.php.net/fix.php?id=47875&r=nozend MySQL Configuration Error: http://bugs.php.net/fix.php?id=47875&r=mysqlcfg