From:             bugs at nikmakepeace dot com
Operating system: FC3
PHP version:      5.1.1
PHP Bug Type:     Unknown/Other Function
Bug description:  tidy does not produce vald utf8 when the encoding is 
specified in the config

Description:
------------
If you specify utf8 encoding using the config options 'char-encoding',
'input-encoding' and 'output-encoding' with tidy it converts HTML entities
into their latin1, single-byte equivalents rather than the correct,
multi-byte utf-8 encodings (or just leaving them as entities) 

The result is that   is converted into 0xA0, é is converted
into 0xE9 and so on. This is not valid UTF-8 and so well-behaving XML
parsers, including PHP's DOM, fail.

Specifying 'utf8' as the third parameter works correctly.

Reproduce code:
---------------
<?php
$dirty='<a
href="http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html";>B&eacute;atrice
Dalle t&eacute;moigne au proc&egrave;s de son mari accus&eacute; de
viol</a><br/>
<small><nobr><a
href="http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/";>人と差がつく就職活動をしよう</a></nobr>
- <nobr><a
href="http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/";>ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>';

$config['char-encoding']='utf8';
$config['input-encoding']='utf8';
$config['output-encoding']='utf8';
$config['output-xhtml']=true;

echo tidy_repair_string($dirty, $config);
?>


Expected result:
----------------
Note well the correct unicode e-acute and e-grave in the French text.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html";>
Béatrice Dalle témoigne au procès de son mari accusé de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/";>
人と差がつく就職活動をしよう</a></nobr> - <nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/";>
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


Actual result:
--------------
Note how the e-acute and e-grave has been replaced with a non-unicode
character.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html";>
B�atrice Dalle t�moigne au proc�s de son mari accus� de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/";>
人と差がつく就職活動をしよう</a></nobr> -
<nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/";>
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


-- 
Edit bug report at http://bugs.php.net/?id=35647&edit=1
-- 
Try a CVS snapshot (PHP 4.4): 
http://bugs.php.net/fix.php?id=35647&r=trysnapshot44
Try a CVS snapshot (PHP 5.1): 
http://bugs.php.net/fix.php?id=35647&r=trysnapshot51
Try a CVS snapshot (PHP 6.0): 
http://bugs.php.net/fix.php?id=35647&r=trysnapshot60
Fixed in CVS:                 http://bugs.php.net/fix.php?id=35647&r=fixedcvs
Fixed in release:             
http://bugs.php.net/fix.php?id=35647&r=alreadyfixed
Need backtrace:               http://bugs.php.net/fix.php?id=35647&r=needtrace
Need Reproduce Script:        http://bugs.php.net/fix.php?id=35647&r=needscript
Try newer version:            http://bugs.php.net/fix.php?id=35647&r=oldversion
Not developer issue:          http://bugs.php.net/fix.php?id=35647&r=support
Expected behavior:            http://bugs.php.net/fix.php?id=35647&r=notwrong
Not enough info:              
http://bugs.php.net/fix.php?id=35647&r=notenoughinfo
Submitted twice:              
http://bugs.php.net/fix.php?id=35647&r=submittedtwice
register_globals:             http://bugs.php.net/fix.php?id=35647&r=globals
PHP 3 support discontinued:   http://bugs.php.net/fix.php?id=35647&r=php3
Daylight Savings:             http://bugs.php.net/fix.php?id=35647&r=dst
IIS Stability:                http://bugs.php.net/fix.php?id=35647&r=isapi
Install GNU Sed:              http://bugs.php.net/fix.php?id=35647&r=gnused
Floating point limitations:   http://bugs.php.net/fix.php?id=35647&r=float
No Zend Extensions:           http://bugs.php.net/fix.php?id=35647&r=nozend
MySQL Configuration Error:    http://bugs.php.net/fix.php?id=35647&r=mysqlcfg

Reply via email to