ID: 34776 User updated by: narzeczony at zabuchy dot net Reported By: narzeczony at zabuchy dot net Status: Open Bug Type: mbstring related Operating System: Linux, Windows PHP Version: 5.0.5 New Comment:
I'm not specifying which endianess mb_convert_encoding should use to convert to ISO. Look: $utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16'); I'm converting from UTF-16 (LE or BE) to ISO-8859-1. It looks like mb_convert_encoding is checking BOM field and choosing right encoding (if you remove BOM field it won't be converted properly for one endianess). The only problem is that BOM is not ignored. The first two lines with endianess specified: $utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1'); $utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1'); are just for convient UTF-16 string creation, please ignore them. Previous Comments: ------------------------------------------------------------------------ [2005-10-07 11:57:10] [EMAIL PROTECTED] I think this is correct as you are not supposed to supply a BOM if you specify which endianness your UTF16 stream is in. ------------------------------------------------------------------------ [2005-10-07 11:52:16] narzeczony at zabuchy dot net There is also small typo in documentation but I dont want to open another bug. On http://ie.php.net/mbstring this section is repeated twice: Name in the IANA character set registry: UTF-16BE Underlying character set: Unicode Description: See above. Additional note: In contrast to UTF-16, strings are always assumed to be in big endian form. While one should be about UTF-16BE and other about UTF-16LE. ------------------------------------------------------------------------ [2005-10-07 11:47:13] narzeczony at zabuchy dot net Description: ------------ When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2 first bytes of UTF-16 text) should be removed, while mb_convert_encoding function is trying to convert them. Problem is similar to bug #22108 but maybe this one can be fixed. Reproduce code: --------------- $iso_8859_1 = 'Nexor'; $utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1'); $utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1'); //lets convert both to UTF-16 //the only difference is 2 byte long BOM field added at the beggining // \xFF\xFE for little endian $utf16LE = "\xFF\xFE".$utf16LE; foreach (str_split($utf16LE) as $l) {echo ord($l).' ';} echo ' --> '; $utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16'); var_dump($utf16LE2iso); echo '<br/>'; // \xFE\xFF for big endian $utf16BE = "\xFE\xFF".$utf16BE; foreach (str_split($utf16BE) as $l) {echo ord($l).' ';} echo ' --> '; $utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16'); var_dump($utf16BE2iso); Expected result: ---------------- 255 254 78 0 101 0 120 0 111 0 114 0 --> string(5) "Nexor" 254 255 0 78 0 101 0 120 0 111 0 114 --> string(5) "Nexor" Actual result: -------------- 255 254 78 0 101 0 120 0 111 0 114 0 --> string(6) "??exor" 254 255 0 78 0 101 0 120 0 111 0 114 --> string(6) "?Nexor" ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=34776&edit=1