#34776 [Opn]: mb_convert_encoding - wrong convertion from UTF-16 (problem with BOM)

narzeczony at zabuchy dot net Fri, 07 Oct 2005 03:34:07 -0700

 ID:               34776
 User updated by:  narzeczony at zabuchy dot net
 Reported By:      narzeczony at zabuchy dot net
 Status:           Open
 Bug Type:         mbstring related
 Operating System: Linux, Windows
 PHP Version:      5.0.5
 New Comment:


I'm not specifying which endianess mb_convert_encoding should use to
convert to ISO. Look:
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');

I'm converting from UTF-16 (LE or BE) to ISO-8859-1. It looks like
mb_convert_encoding is checking BOM field and choosing right encoding
(if you remove BOM field it won't be converted properly for one
endianess). The only problem is that BOM is not ignored.

The first two lines with endianess specified:
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');
are just for convient UTF-16 string creation, please ignore them.


Previous Comments:
------------------------------------------------------------------------

[2005-10-07 11:57:10] [EMAIL PROTECTED]

I think this is correct as you are not supposed to supply a BOM if you
specify which endianness your UTF16 stream is in.

------------------------------------------------------------------------

[2005-10-07 11:52:16] narzeczony at zabuchy dot net

There is also small typo in documentation but I dont want to open
another bug.
On http://ie.php.net/mbstring this section is repeated twice:

Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description: See above.
Additional note: In contrast to UTF-16, strings are always assumed to
be in big endian form. 

While one should be about UTF-16BE and other about UTF-16LE.

------------------------------------------------------------------------

[2005-10-07 11:47:13] narzeczony at zabuchy dot net

Description:
------------
When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2
first bytes of UTF-16 text) should be removed, while
mb_convert_encoding function is trying to convert them.
Problem is similar to bug #22108 but maybe this one can be fixed. 

Reproduce code:
---------------
$iso_8859_1 = 'Nexor';
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');

//lets convert both to UTF-16
//the only difference is 2 byte long BOM field added at the beggining
// \xFF\xFE for little endian
$utf16LE = "\xFF\xFE".$utf16LE;
foreach (str_split($utf16LE) as $l) {echo ord($l).' ';}
echo ' --> ';
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');
var_dump($utf16LE2iso);

echo '<br/>';

// \xFE\xFF for big endian
$utf16BE = "\xFE\xFF".$utf16BE;
foreach (str_split($utf16BE) as $l) {echo ord($l).' ';}
echo ' --> ';
$utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16');
var_dump($utf16BE2iso);


Expected result:
----------------
255 254 78 0 101 0 120 0 111 0 114 0 --> string(5) "Nexor"
254 255 0 78 0 101 0 120 0 111 0 114 --> string(5) "Nexor"


Actual result:
--------------
255 254 78 0 101 0 120 0 111 0 114 0 --> string(6) "??exor"
254 255 0 78 0 101 0 120 0 111 0 114 --> string(6) "?Nexor"


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=34776&edit=1

#34776 [Opn]: mb_convert_encoding - wrong convertion from UTF-16 (problem with BOM)

Reply via email to