From: xlex0x835 at rambler dot ru Operating system: Mac OS X 10.3, FreeBSD 5.3 PHP version: 5.0.3 PHP Bug Type: DOM XML related Bug description: DOMDocument->loadHTML() seems to broke (utf-8 russian) codepage
Description: ------------ If I use DOMDocument->loadHTML() method with an utf-8 HTML, which contains russian characters, that russian characters just messed (please see 'Actual result'). Nothing changed if I specify encoding "by hand" (I mean the following call: "$domDoc = new DOMDocument('1.0', 'utf-8');"). But, eveything works just fine if I use DOMDocument- >loadXML() method (that's why there is xml definition string in the input). Nothing changed if I will remove all $domDoc options, neither removing "<?xml ... ?>" string (it is actually exist only to get one source for both loadHTML() and loadXML() functions call - to test error). The problem was discrovered on the "real-world" HTML, the code was stripped to the minimum for the ease of use. Host info. =================================== [PHP Modules (on FreeBSD 5.3 host)] bcmath bz2 calendar ctype curl dom exif ftp gd gettext gmp iconv imap libxml mbstring mcrypt mcve mhash mysql ncurses odbc openssl pcntl pcre pgsql posix pspell readline session shmop SimpleXML snmp soap sockets SPL SQLite standard sysvmsg sysvsem sysvshm tidy tokenizer wddx xml xmlrpc xsl yaz yp zip zlib No Zend modules. FreeBSD 5.3-RELEASE libxml2-2.6.13 gcc (GCC) 3.4.2 [FreeBSD] 20040728 Reproduce code: --------------- <?php $xmlContent = file_get_contents('input_test'); $domDoc = new DOMDocument(); $domDoc->formatOutput = true; $domDoc->preserveWhiteSpace = false; $domDoc->recover = true; $domDoc->loadXML($xmlContent); file_put_contents('output_test', $domDoc->saveXML()); ?> input_test: =========== <?xml version="1.0" encoding="utf-8"?> <html> <head> <title> - Test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> </html> Expected result: ---------------- <?xml version="1.0" encoding="utf-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/ loose.dtd"> <html> <head> <title> - Test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> </html> Actual result: -------------- <?xml version="1.0" encoding="utf-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/ loose.dtd"> <html> <head> <title>ТеÑÑ - Test</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> </html> -- Edit bug report at http://bugs.php.net/?id=32547&edit=1 -- Try a CVS snapshot (php4): http://bugs.php.net/fix.php?id=32547&r=trysnapshot4 Try a CVS snapshot (php5.0): http://bugs.php.net/fix.php?id=32547&r=trysnapshot50 Try a CVS snapshot (php5.1): http://bugs.php.net/fix.php?id=32547&r=trysnapshot51 Fixed in CVS: http://bugs.php.net/fix.php?id=32547&r=fixedcvs Fixed in release: http://bugs.php.net/fix.php?id=32547&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=32547&r=needtrace Need Reproduce Script: http://bugs.php.net/fix.php?id=32547&r=needscript Try newer version: http://bugs.php.net/fix.php?id=32547&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=32547&r=support Expected behavior: http://bugs.php.net/fix.php?id=32547&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=32547&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=32547&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=32547&r=globals PHP 3 support discontinued: http://bugs.php.net/fix.php?id=32547&r=php3 Daylight Savings: http://bugs.php.net/fix.php?id=32547&r=dst IIS Stability: http://bugs.php.net/fix.php?id=32547&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=32547&r=gnused Floating point limitations: http://bugs.php.net/fix.php?id=32547&r=float No Zend Extensions: http://bugs.php.net/fix.php?id=32547&r=nozend MySQL Configuration Error: http://bugs.php.net/fix.php?id=32547&r=mysqlcfg