ID:               34776
 Comment by:       jdephix at polenord dot com
 Reported By:      narzeczony at zabuchy dot net
 Status:           No Feedback
 Bug Type:         mbstring related
 Operating System: Linux, Windows
 PHP Version:      5.0.5
 New Comment:

UTF-16LE and UTF-16BE seem mixed up when using mb_convert_encoding.

I want to read the content of a file in UTF-16BE (starts with \xFE\xFF)
and convert it into UTF-8:

$s = file_get_contents($fileUTF16BE);
$s = mb_convert_encoding($s, 'UTF-8', "UTF-16BE");
//some operations on $s
file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s,
'UTF-16BE', "UTF-8"));

The second file is in Little Endian (starts with \xFF\FE)!!!

I have to specify LE if I want BE.
file_put_contents($anotherUTF16BEfile, mb_convert_encoding($s,
'UTF-16LE', "UTF-8"));

How come it's reversed?


Previous Comments:
------------------------------------------------------------------------

[2006-06-23 16:11:32] markl at lindenlab dot com

There are two problems when mb_convert_encoding is 
converting from UTF-16:

1) It is including the (transcoded) BOM in the result, 
rather than stripping it

2) If the source UTF-16 string was little endian, then the 
second character of the conversion will be wrong; it is 
converted as if the character code had 0xFF00 or'd into it.

Problem 1 occurs with any UTF-16 variant (though it is 
arguably correct behavior for UTF-16LE and UTF-16BE).  
Problem 2 only occurs when converting from UTF-16.

This PHP program demonstrates this all clearly:



function dump($s)
{
        for ($i = 0; $i < strlen($s); ++$i) {
                echo substr(dechex(256+ord(substr($s, $i, 1))), 1, 
2),  ' ';
        }
        var_dump($s);
}

$utf16le = "\xFF\xFE\x41\x00\x42\x00\x43\x00";
$utf16be = "\xFE\xFF\x00\x41\x00\x42\x00\x43";
        // these strings are both valid UTF-16, the BOM at the 
start indicates
        // the endianness.  We don't expect the BOM to be 
included in a conversion

echo "The UTF-16LE and UTF-16BE sequences:\n";
dump($utf16le);
dump($utf16be);
echo "\n";

$encodings = array("ascii", "iso-8859-1", "utf-8", "utf-16", 
"utf-16le", "utf-16be");

foreach ($encodings as $enc) {
        echo "Converting to $enc:\n";
        dump(mb_convert_encoding($utf16le, $enc, "utf-16"));
        dump(mb_convert_encoding($utf16be, $enc, "utf-16"));
        echo "\n";
}

------------------------------------------------------------------------

[2005-10-15 01:00:03] php-bugs at lists dot php dot net

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

------------------------------------------------------------------------

[2005-10-07 21:58:46] [EMAIL PROTECTED]

Please try using this CVS snapshot:

  http://snaps.php.net/php5-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php5-win32-latest.zip



------------------------------------------------------------------------

[2005-10-07 16:36:23] narzeczony at zabuchy dot net

The same example but with iconv instead of mb_convert_encoding works
perfect - but it doesn't close bug related to mb_convert_encoding I
guess :).

Another problem exist with converting to 'UTF-16' (using
mb_convert_encoding) - BOM section is not added. Again iconv works well
in this case.

------------------------------------------------------------------------

[2005-10-07 12:43:32] [EMAIL PROTECTED]

ah, mbstring has a weird parameter order (dest, src) instead of (src,
dest)... did you try to use iconv perhaps?

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/34776

-- 
Edit this bug report at http://bugs.php.net/?id=34776&edit=1

Reply via email to