From:             xlex0x835 at rambler dot ru
Operating system: Mac OS X 10.3, FreeBSD 5.3
PHP version:      5.0.3
PHP Bug Type:     DOM XML related
Bug description:  DOMDocument->loadHTML() seems to broke (utf-8 russian) 
codepage

Description:
------------
If I use DOMDocument->loadHTML() method with an utf-8 
HTML, which contains russian characters, that russian 
characters just messed (please see 'Actual result').

Nothing changed if I specify encoding "by hand" (I mean 
the following call: "$domDoc = new DOMDocument('1.0', 
'utf-8');").

But, eveything works just fine if I use DOMDocument-
>loadXML() method (that's why there is xml definition 
string in the input).

Nothing changed if I will remove all $domDoc options, 
neither removing "<?xml ... ?>" string (it is actually 
exist only to get one source for both loadHTML() and 
loadXML() functions call - to test error).

The problem was discrovered on the "real-world" HTML, 
the code was stripped to the minimum for the ease of 
use.


Host info.
===================================

[PHP Modules (on FreeBSD 5.3 host)]
bcmath
bz2
calendar
ctype
curl
dom
exif
ftp
gd
gettext
gmp
iconv
imap
libxml
mbstring
mcrypt
mcve
mhash
mysql
ncurses
odbc
openssl
pcntl
pcre
pgsql
posix
pspell
readline
session
shmop
SimpleXML
snmp
soap
sockets
SPL
SQLite
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
wddx
xml
xmlrpc
xsl
yaz
yp
zip
zlib

No Zend modules.


FreeBSD 5.3-RELEASE
libxml2-2.6.13
gcc (GCC) 3.4.2 [FreeBSD] 20040728

Reproduce code:
---------------
<?php 

$xmlContent = file_get_contents('input_test'); 

$domDoc = new DOMDocument(); 
$domDoc->formatOutput = true; 
$domDoc->preserveWhiteSpace = false; 
$domDoc->recover = true; 
$domDoc->loadXML($xmlContent); 
 
file_put_contents('output_test', $domDoc->saveXML()); 
?> 



input_test:
===========
<?xml version="1.0" encoding="utf-8"?>
<html>
<head>
<title> - Test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
</html>

Expected result:
----------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 
Transitional//EN" "http://www.w3.org/TR/REC-html40/
loose.dtd">
<html>
  <head>
    <title> - Test</title>
    <meta http-equiv="Content-Type" content="text/html; 
charset=utf-8"/>
  </head>
</html>

Actual result:
--------------
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 
Transitional//EN" "http://www.w3.org/TR/REC-html40/
loose.dtd">
<html>
  <head>
    <title>Тест - Test</title>
    <meta http-equiv="Content-Type" content="text/html; 
charset=utf-8"/>
  </head>
</html>

-- 
Edit bug report at http://bugs.php.net/?id=32547&edit=1
-- 
Try a CVS snapshot (php4):   http://bugs.php.net/fix.php?id=32547&r=trysnapshot4
Try a CVS snapshot (php5.0): 
http://bugs.php.net/fix.php?id=32547&r=trysnapshot50
Try a CVS snapshot (php5.1): 
http://bugs.php.net/fix.php?id=32547&r=trysnapshot51
Fixed in CVS:                http://bugs.php.net/fix.php?id=32547&r=fixedcvs
Fixed in release:            http://bugs.php.net/fix.php?id=32547&r=alreadyfixed
Need backtrace:              http://bugs.php.net/fix.php?id=32547&r=needtrace
Need Reproduce Script:       http://bugs.php.net/fix.php?id=32547&r=needscript
Try newer version:           http://bugs.php.net/fix.php?id=32547&r=oldversion
Not developer issue:         http://bugs.php.net/fix.php?id=32547&r=support
Expected behavior:           http://bugs.php.net/fix.php?id=32547&r=notwrong
Not enough info:             
http://bugs.php.net/fix.php?id=32547&r=notenoughinfo
Submitted twice:             
http://bugs.php.net/fix.php?id=32547&r=submittedtwice
register_globals:            http://bugs.php.net/fix.php?id=32547&r=globals
PHP 3 support discontinued:  http://bugs.php.net/fix.php?id=32547&r=php3
Daylight Savings:            http://bugs.php.net/fix.php?id=32547&r=dst
IIS Stability:               http://bugs.php.net/fix.php?id=32547&r=isapi
Install GNU Sed:             http://bugs.php.net/fix.php?id=32547&r=gnused
Floating point limitations:  http://bugs.php.net/fix.php?id=32547&r=float
No Zend Extensions:          http://bugs.php.net/fix.php?id=32547&r=nozend
MySQL Configuration Error:   http://bugs.php.net/fix.php?id=32547&r=mysqlcfg

Reply via email to