ID:               34776
 Comment by:       markl at lindenlab dot com
 Reported By:      narzeczony at zabuchy dot net
 Status:           No Feedback
 Bug Type:         mbstring related
 Operating System: Linux, Windows
 PHP Version:      5.0.5
 New Comment:

There are two problems when mb_convert_encoding is 
converting from UTF-16:

1) It is including the (transcoded) BOM in the result, 
rather than stripping it

2) If the source UTF-16 string was little endian, then the 
second character of the conversion will be wrong; it is 
converted as if the character code had 0xFF00 or'd into it.

Problem 1 occurs with any UTF-16 variant (though it is 
arguably correct behavior for UTF-16LE and UTF-16BE).  
Problem 2 only occurs when converting from UTF-16.

This PHP program demonstrates this all clearly:



function dump($s)
{
        for ($i = 0; $i < strlen($s); ++$i) {
                echo substr(dechex(256+ord(substr($s, $i, 1))), 1, 
2),  ' ';
        }
        var_dump($s);
}

$utf16le = "\xFF\xFE\x41\x00\x42\x00\x43\x00";
$utf16be = "\xFE\xFF\x00\x41\x00\x42\x00\x43";
        // these strings are both valid UTF-16, the BOM at the 
start indicates
        // the endianness.  We don't expect the BOM to be 
included in a conversion

echo "The UTF-16LE and UTF-16BE sequences:\n";
dump($utf16le);
dump($utf16be);
echo "\n";

$encodings = array("ascii", "iso-8859-1", "utf-8", "utf-16", 
"utf-16le", "utf-16be");

foreach ($encodings as $enc) {
        echo "Converting to $enc:\n";
        dump(mb_convert_encoding($utf16le, $enc, "utf-16"));
        dump(mb_convert_encoding($utf16be, $enc, "utf-16"));
        echo "\n";
}


Previous Comments:
------------------------------------------------------------------------

[2005-10-15 01:00:03] php-bugs at lists dot php dot net

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

------------------------------------------------------------------------

[2005-10-07 21:58:46] [EMAIL PROTECTED]

Please try using this CVS snapshot:

  http://snaps.php.net/php5-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php5-win32-latest.zip



------------------------------------------------------------------------

[2005-10-07 16:36:23] narzeczony at zabuchy dot net

The same example but with iconv instead of mb_convert_encoding works
perfect - but it doesn't close bug related to mb_convert_encoding I
guess :).

Another problem exist with converting to 'UTF-16' (using
mb_convert_encoding) - BOM section is not added. Again iconv works well
in this case.

------------------------------------------------------------------------

[2005-10-07 12:43:32] [EMAIL PROTECTED]

ah, mbstring has a weird parameter order (dest, src) instead of (src,
dest)... did you try to use iconv perhaps?

------------------------------------------------------------------------

[2005-10-07 12:33:45] narzeczony at zabuchy dot net

I'm not specifying which endianess mb_convert_encoding should use to
convert to ISO. Look:
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');

I'm converting from UTF-16 (LE or BE) to ISO-8859-1. It looks like
mb_convert_encoding is checking BOM field and choosing right encoding
(if you remove BOM field it won't be converted properly for one
endianess). The only problem is that BOM is not ignored.

The first two lines with endianess specified:
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');
are just for convient UTF-16 string creation, please ignore them.

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/34776

-- 
Edit this bug report at http://bugs.php.net/?id=34776&edit=1

Reply via email to