From: bugs at nikmakepeace dot com Operating system: FC3 PHP version: 5.1.1 PHP Bug Type: Unknown/Other Function Bug description: tidy does not produce vald utf8 when the encoding is specified in the config
Description: ------------ If you specify utf8 encoding using the config options 'char-encoding', 'input-encoding' and 'output-encoding' with tidy it converts HTML entities into their latin1, single-byte equivalents rather than the correct, multi-byte utf-8 encodings (or just leaving them as entities) The result is that is converted into 0xA0, é is converted into 0xE9 and so on. This is not valid UTF-8 and so well-behaving XML parsers, including PHP's DOM, fail. Specifying 'utf8' as the third parameter works correctly. Reproduce code: --------------- <?php $dirty='<a href="http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">Béatrice Dalle témoigne au procès de son mari accusé de viol</a><br/> <small><nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">人ã¨å·®ãã¤ãå°±è·æ´»åãããã</a></nobr> - <nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">ãã¤ã³ã5åã®ã¯ãªã¹ãã¹ã®ããã¯12æã¾ã§ï¼</a></nobr></small>'; $config['char-encoding']='utf8'; $config['input-encoding']='utf8'; $config['output-encoding']='utf8'; $config['output-xhtml']=true; echo tidy_repair_string($dirty, $config); ?> Expected result: ---------------- Note well the correct unicode e-acute and e-grave in the French text. <?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <a href= "http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html"> Béatrice Dalle témoigne au procès de son mari accusé de viol</a><br /> <small><nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/"> 人ã¨å·®ãã¤ãå°±è·æ´»åãããã</a></nobr> - <nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/"> ãã¤ã³ã5åã®ã¯ãªã¹ãã¹ã®ããã¯12æã¾ã§ï¼</a></nobr></small> </body> </html> Actual result: -------------- Note how the e-acute and e-grave has been replaced with a non-unicode character. <?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <a href= "http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html"> B�atrice Dalle t�moigne au proc�s de son mari accus� de viol</a><br /> <small><nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/"> 人ã¨å·®ãã¤ãå°±è·æ´»åãããã</a></nobr> - <nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/"> ãã¤ã³ã5åã®ã¯ãªã¹ãã¹ã®ããã¯12æã¾ã§ï¼</a></nobr></small> </body> </html> -- Edit bug report at http://bugs.php.net/?id=35647&edit=1 -- Try a CVS snapshot (PHP 4.4): http://bugs.php.net/fix.php?id=35647&r=trysnapshot44 Try a CVS snapshot (PHP 5.1): http://bugs.php.net/fix.php?id=35647&r=trysnapshot51 Try a CVS snapshot (PHP 6.0): http://bugs.php.net/fix.php?id=35647&r=trysnapshot60 Fixed in CVS: http://bugs.php.net/fix.php?id=35647&r=fixedcvs Fixed in release: http://bugs.php.net/fix.php?id=35647&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=35647&r=needtrace Need Reproduce Script: http://bugs.php.net/fix.php?id=35647&r=needscript Try newer version: http://bugs.php.net/fix.php?id=35647&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=35647&r=support Expected behavior: http://bugs.php.net/fix.php?id=35647&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=35647&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=35647&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=35647&r=globals PHP 3 support discontinued: http://bugs.php.net/fix.php?id=35647&r=php3 Daylight Savings: http://bugs.php.net/fix.php?id=35647&r=dst IIS Stability: http://bugs.php.net/fix.php?id=35647&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=35647&r=gnused Floating point limitations: http://bugs.php.net/fix.php?id=35647&r=float No Zend Extensions: http://bugs.php.net/fix.php?id=35647&r=nozend MySQL Configuration Error: http://bugs.php.net/fix.php?id=35647&r=mysqlcfg