ID: 41980 User updated by: borys dot forytarz at gmail dot com Reported By: borys dot forytarz at gmail dot com -Status: Feedback +Status: Open Bug Type: DOM XML related Operating System: Linux PHP Version: 5.2.0 New Comment:
I have checked about files encodings. mb_detect_encoding() returns, that they are ASCII-encoded (!?). So I wrote a simple script to convert them to utf-8: <?php $cont = file_get_contents('login.php.tpl'); $f = fopen('login.php.tpl','w'); echo "\n".mb_detect_encoding('login.php.tpl').' > '; fwrite($f,mb_convert_encoding($cont,'utf-8')); echo mb_detect_encoding('login.php.tpl')."\n"; fclose($f); ?> and the output is: ASCII > ASCII (I expected ASCII > UTF-8) result of using iconv instead of mb_convert_encoding is the same what's going on? Previous Comments: ------------------------------------------------------------------------ [2007-07-12 20:38:33] [EMAIL PROTECTED] Please try using this CVS snapshot: http://snaps.php.net/php5.2-latest.tar.gz For Windows (zip): http://snaps.php.net/win32/php5.2-win32-latest.zip For Windows (installer): http://snaps.php.net/win32/php5.2-win32-installer-latest.msi ------------------------------------------------------------------------ [2007-07-12 19:58:58] borys dot forytarz at gmail dot com there should be: ... foreach($content->childNodes as $child) { ... sorry ------------------------------------------------------------------------ [2007-07-12 19:55:58] borys dot forytarz at gmail dot com Here is an example: At first, source files (both encoded with UTF-8) First file (main.tpl): <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>Some title</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> </head> <body> Some polish letters: ę ó ą ś ć ż ź ń - they are encoded correctly and displays correctly. </body> </html> Second file (contents.tpl): <content> <h1>some polish letters, like: ę ó ł ą ś ć ź ń ż - they are not encoded correctly and does not display correctly.</h1> </content> PHP file: <?php $dom = new DOMDocument('1.0','UTF-8'); $dom->loadHtmlFile('main.tpl'); $dom2 = new DOMDocument('1.0','UTF-8'); $dom2->loadHTMLFile('contents.tpl'); $contents = $dom2->getElementsByTagName('content'); $body = $dom->getElementsByTagName('body')->items(0); foreach($contents as $content) { foreach($content as $child) { $imp = $dom->importNode($child,true); $body->appendChild($imp); } } $dom->saveXML(); ?> It is something like above. I was writing from memory because the real script is really huge. But it demonstrates the idea and what is going not properly. ------------------------------------------------------------------------ [2007-07-12 19:24:45] borys dot forytarz at gmail dot com Description: ------------ There is a problem with DOM and encoding. I have two separate files, one full XHTML code (DTD, head, meta, body and more contents) saved in UTF-8. Meta declaration is UTF-8, server sends the code in UTF-8 too. The second file is a simple file without any DTD, head, meta and body. Saved in UTF-8 too. The problem is, when I import nodes from the second file using importNode(), in the output there are invalid encoded characters (those who were declared in the second file). It is strange because as I read, DOM works in UTF-8 so there should be not such a problem. What is more, I was debugging the properties such as actualEncoding and they shown me that there is UTF-8... If it's not a bug, but I think it is, how to fix that? I can't declare in the second file DTD, head and body elements. Reproduce code: --------------- $this->dom = new DOMDocument('1.0','UTF-8'); $this->dom->encoding = 'UTF-8'; $this->dom->formatOutput = self::$formatOutput; $this->dom->preserveWhiteSpace = self::$preserveWhiteSpace; @$this->dom->loadHtmlFile($html); ... echo $this->dom->saveXML(); The above works well for the complete XHTML file. But when I load an incomplete file (encoded in UTF-8) I don't see properly encoded characters when I import nodes from the second document to the first one. I tried to convert the whole output with iconv() and mb_convert_encoding() but it seems not to make any difference at all. Expected result: ---------------- Properly encoded characters from both complete XHTML file and second "poor" file. The second file is such as follows: <content id="something"> <h1>some string</h1> </content> Actual result: -------------- Not properly encoded characters from between <content> tag. ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=41980&edit=1