ID: 35647 User updated by: bugs at nikmakepeace dot com Reported By: bugs at nikmakepeace dot com -Status: No Feedback +Status: Open Bug Type: XML related Operating System: FC3 PHP Version: 5.1.1 New Comment:
The source is available at http://www.nikmakepeace.com/testcases/tidy-utf8.phps Be sure to force your browser's character encoding to utf-8 before copying it. Note also that changing the last line to echo tidy_repair_string($dirty, $config, 'utf8'); produces the desired results, but should not be necessary. Previous Comments: ------------------------------------------------------------------------ [2005-12-20 01:00:05] php-bugs at lists dot php dot net No feedback was provided for this bug for over a week, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open". ------------------------------------------------------------------------ [2005-12-12 22:05:50] [EMAIL PROTECTED] Put the data somewhere in the Net and paste the link here, please. ------------------------------------------------------------------------ [2005-12-12 18:44:35] bugs at nikmakepeace dot com Description: ------------ If you specify utf8 encoding using the config options 'char-encoding', 'input-encoding' and 'output-encoding' with tidy it converts HTML entities into their latin1, single-byte equivalents rather than the correct, multi-byte utf-8 encodings (or just leaving them as entities) The result is that is converted into 0xA0, é is converted into 0xE9 and so on. This is not valid UTF-8 and so well-behaving XML parsers, including PHP's DOM, fail. Specifying 'utf8' as the third parameter works correctly. Reproduce code: --------------- <?php $dirty='<a href="http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html">Béatrice Dalle témoigne au procès de son mari accusé de viol</a><br/> <small><nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/">人ã¨å·®ãã¤ãå°±è·æ´»åãããã</a></nobr> - <nobr><a href="http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/">ãã¤ã³ã5åã®ã¯ãªã¹ãã¹ã®ããã¯12æã¾ã§ï¼</a></nobr></small>'; $config['char-encoding']='utf8'; $config['input-encoding']='utf8'; $config['output-encoding']='utf8'; $config['output-xhtml']=true; echo tidy_repair_string($dirty, $config); ?> Expected result: ---------------- Note well the correct unicode e-acute and e-grave in the French text. <?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <a href= "http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html"> Béatrice Dalle témoigne au procès de son mari accusé de viol</a><br /> <small><nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/"> 人ã¨å·®ãã¤ãå°±è·æ´»åãããã</a></nobr> - <nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/"> ãã¤ã³ã5åã®ã¯ãªã¹ãã¹ã®ããã¯12æã¾ã§ï¼</a></nobr></small> </body> </html> Actual result: -------------- Note how the e-acute and e-grave has been replaced with a non-unicode character. <?xml version="1.0"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> <a href= "http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html"> B�atrice Dalle t�moigne au proc�s de son mari accus� de viol</a><br /> <small><nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/"> 人ã¨å·®ãã¤ãå°±è·æ´»åãããã</a></nobr> - <nobr><a href= "http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/"> ãã¤ã³ã5åã®ã¯ãªã¹ãã¹ã®ããã¯12æã¾ã§ï¼</a></nobr></small> </body> </html> ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=35647&edit=1