ID:               35647
 Updated by:       [EMAIL PROTECTED]
 Reported By:      bugs at nikmakepeace dot com
 Status:           Open
 Bug Type:         XML related
 Operating System: FC3
 PHP Version:      5.1.1
 New Comment:

ye, this is a known problem.
But from what I can see from the code, this seems to be a tidylib
problem, rather than PHP's.


Previous Comments:
------------------------------------------------------------------------

[2006-01-27 10:35:35] bugs at nikmakepeace dot com

The source is available at
http://www.nikmakepeace.com/testcases/tidy-utf8.phps

Be sure to force your browser's character encoding to utf-8 before
copying it.

Note also that changing the last line to  echo
tidy_repair_string($dirty, $config, 'utf8'); produces the desired
results, but should not be necessary.

------------------------------------------------------------------------

[2005-12-12 22:05:50] [EMAIL PROTECTED]

Put the data somewhere in the Net and paste the link here, please.


------------------------------------------------------------------------

[2005-12-12 18:44:35] bugs at nikmakepeace dot com

Description:
------------
If you specify utf8 encoding using the config options 'char-encoding',
'input-encoding' and 'output-encoding' with tidy it converts HTML
entities into their latin1, single-byte equivalents rather than the
correct, multi-byte utf-8 encodings (or just leaving them as entities)


The result is that   is converted into 0xA0, é is converted
into 0xE9 and so on. This is not valid UTF-8 and so well-behaving XML
parsers, including PHP's DOM, fail.

Specifying 'utf8' as the third parameter works correctly.

Reproduce code:
---------------
<?php
$dirty='<a
href="http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html";>B&eacute;atrice
Dalle t&eacute;moigne au proc&egrave;s de son mari accus&eacute; de
viol</a><br/>
<small><nobr><a
href="http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/";>人と差がつく就職活動をしよう</a></nobr>
- <nobr><a
href="http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/";>ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>';

$config['char-encoding']='utf8';
$config['input-encoding']='utf8';
$config['output-encoding']='utf8';
$config['output-xhtml']=true;

echo tidy_repair_string($dirty, $config);
?>


Expected result:
----------------
Note well the correct unicode e-acute and e-grave in the French text.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html";>
Béatrice Dalle témoigne au procès de son mari accusé de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/";>
人と差がつく就職活動をしよう</a></nobr> - <nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/";>
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>


Actual result:
--------------
Note how the e-acute and e-grave has been replaced with a non-unicode
character.

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<title></title>
</head>
<body>
<a href=
"http://fr.yahoo.com/r/n/fd/7/*http://fr.news.yahoo.com/12122005/202/beatrice-dalle-temoigne-au-proces-de-son-mari-accuse-de.html";>
B�atrice Dalle t�moigne au proc�s de son mari accus� de
viol</a><br />
<small><nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/rikunabi/051213/?http://katsuyou.rikunabi-shinsotsu.yahoo.co.jp/2007/";>
人と差がつく就職活動をしよう</a></nobr> -
<nobr><a href=
"http://rd.yahoo.co.jp/toppage/topinfo/event_xmas/051125/?http://xmas.yahoo.co.jp/";>
ポイント5倍のクリスマスギフトは12時まで!</a></nobr></small>
</body>
</html>



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=35647&edit=1

Reply via email to