ID: 34776 Comment by: jdephix at polenord dot com Reported By: narzeczony at zabuchy dot net Status: No Feedback Bug Type: mbstring related Operating System: Linux, Windows PHP Version: 5.0.5 New Comment:
UTF-16LE and UTF-16BE seem mixed up when using mb_convert_encoding. I want to read the content of a file in UTF-16BE (starts with \xFE\xFF) and convert it into UTF-8: $s = file_get_contents($fileUTF16BE); $s = mb_convert_encoding($s, 'UTF-8', "UTF-16BE"); //some operations on $s file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16BE', "UTF-8")); The second file is in Little Endian (starts with \xFF\FE)!!! I have to specify LE if I want BE. file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s, 'UTF-16LE', "UTF-8")); How come it's reversed? Previous Comments: ------------------------------------------------------------------------ [2006-06-23 16:11:32] markl at lindenlab dot com There are two problems when mb_convert_encoding is converting from UTF-16: 1) It is including the (transcoded) BOM in the result, rather than stripping it 2) If the source UTF-16 string was little endian, then the second character of the conversion will be wrong; it is converted as if the character code had 0xFF00 or'd into it. Problem 1 occurs with any UTF-16 variant (though it is arguably correct behavior for UTF-16LE and UTF-16BE). Problem 2 only occurs when converting from UTF-16. This PHP program demonstrates this all clearly: function dump($s) { for ($i = 0; $i < strlen($s); ++$i) { echo substr(dechex(256+ord(substr($s, $i, 1))), 1, 2), ' '; } var_dump($s); } $utf16le = "\xFF\xFE\x41\x00\x42\x00\x43\x00"; $utf16be = "\xFE\xFF\x00\x41\x00\x42\x00\x43"; // these strings are both valid UTF-16, the BOM at the start indicates // the endianness. We don't expect the BOM to be included in a conversion echo "The UTF-16LE and UTF-16BE sequences:\n"; dump($utf16le); dump($utf16be); echo "\n"; $encodings = array("ascii", "iso-8859-1", "utf-8", "utf-16", "utf-16le", "utf-16be"); foreach ($encodings as $enc) { echo "Converting to $enc:\n"; dump(mb_convert_encoding($utf16le, $enc, "utf-16")); dump(mb_convert_encoding($utf16be, $enc, "utf-16")); echo "\n"; } ------------------------------------------------------------------------ [2005-10-15 01:00:03] php-bugs at lists dot php dot net No feedback was provided for this bug for over a week, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open". ------------------------------------------------------------------------ [2005-10-07 21:58:46] [EMAIL PROTECTED] Please try using this CVS snapshot: http://snaps.php.net/php5-latest.tar.gz For Windows: http://snaps.php.net/win32/php5-win32-latest.zip ------------------------------------------------------------------------ [2005-10-07 16:36:23] narzeczony at zabuchy dot net The same example but with iconv instead of mb_convert_encoding works perfect - but it doesn't close bug related to mb_convert_encoding I guess :). Another problem exist with converting to 'UTF-16' (using mb_convert_encoding) - BOM section is not added. Again iconv works well in this case. ------------------------------------------------------------------------ [2005-10-07 12:43:32] [EMAIL PROTECTED] ah, mbstring has a weird parameter order (dest, src) instead of (src, dest)... did you try to use iconv perhaps? ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/34776 -- Edit this bug report at http://bugs.php.net/?id=34776&edit=1