#34776 [Opn]: mb_convert_encoding - wrong convertion from UTF-16 (problem with BOM)

2005-10-07 Thread narzeczony at zabuchy dot net
 ID:   34776
 User updated by:  narzeczony at zabuchy dot net
 Reported By:  narzeczony at zabuchy dot net
 Status:   Open
 Bug Type: mbstring related
 Operating System: Linux, Windows
 PHP Version:  5.0.5
 New Comment:

There is also small typo in documentation but I dont want to open
another bug.
On http://ie.php.net/mbstring this section is repeated twice:

Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description: See above.
Additional note: In contrast to UTF-16, strings are always assumed to
be in big endian form. 

While one should be about UTF-16BE and other about UTF-16LE.


Previous Comments:


[2005-10-07 11:47:13] narzeczony at zabuchy dot net

Description:

When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2
first bytes of UTF-16 text) should be removed, while
mb_convert_encoding function is trying to convert them.
Problem is similar to bug #22108 but maybe this one can be fixed. 

Reproduce code:
---
$iso_8859_1 = 'Nexor';
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');

//lets convert both to UTF-16
//the only difference is 2 byte long BOM field added at the beggining
// \xFF\xFE for little endian
$utf16LE = \xFF\xFE.$utf16LE;
foreach (str_split($utf16LE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');
var_dump($utf16LE2iso);

echo 'br/';

// \xFE\xFF for big endian
$utf16BE = \xFE\xFF.$utf16BE;
foreach (str_split($utf16BE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16');
var_dump($utf16BE2iso);


Expected result:

255 254 78 0 101 0 120 0 111 0 114 0 -- string(5) Nexor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(5) Nexor


Actual result:
--
255 254 78 0 101 0 120 0 111 0 114 0 -- string(6) ??exor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(6) ?Nexor





-- 
Edit this bug report at http://bugs.php.net/?id=34776edit=1


#34776 [Opn]: mb_convert_encoding - wrong convertion from UTF-16 (problem with BOM)

2005-10-07 Thread derick
 ID:   34776
 Updated by:   [EMAIL PROTECTED]
 Reported By:  narzeczony at zabuchy dot net
 Status:   Open
 Bug Type: mbstring related
 Operating System: Linux, Windows
 PHP Version:  5.0.5
 New Comment:

I think this is correct as you are not supposed to supply a BOM if you
specify which endianness your UTF16 stream is in.


Previous Comments:


[2005-10-07 11:52:16] narzeczony at zabuchy dot net

There is also small typo in documentation but I dont want to open
another bug.
On http://ie.php.net/mbstring this section is repeated twice:

Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description: See above.
Additional note: In contrast to UTF-16, strings are always assumed to
be in big endian form. 

While one should be about UTF-16BE and other about UTF-16LE.



[2005-10-07 11:47:13] narzeczony at zabuchy dot net

Description:

When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2
first bytes of UTF-16 text) should be removed, while
mb_convert_encoding function is trying to convert them.
Problem is similar to bug #22108 but maybe this one can be fixed. 

Reproduce code:
---
$iso_8859_1 = 'Nexor';
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');

//lets convert both to UTF-16
//the only difference is 2 byte long BOM field added at the beggining
// \xFF\xFE for little endian
$utf16LE = \xFF\xFE.$utf16LE;
foreach (str_split($utf16LE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');
var_dump($utf16LE2iso);

echo 'br/';

// \xFE\xFF for big endian
$utf16BE = \xFE\xFF.$utf16BE;
foreach (str_split($utf16BE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16');
var_dump($utf16BE2iso);


Expected result:

255 254 78 0 101 0 120 0 111 0 114 0 -- string(5) Nexor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(5) Nexor


Actual result:
--
255 254 78 0 101 0 120 0 111 0 114 0 -- string(6) ??exor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(6) ?Nexor





-- 
Edit this bug report at http://bugs.php.net/?id=34776edit=1


#34776 [Opn]: mb_convert_encoding - wrong convertion from UTF-16 (problem with BOM)

2005-10-07 Thread narzeczony at zabuchy dot net
 ID:   34776
 User updated by:  narzeczony at zabuchy dot net
 Reported By:  narzeczony at zabuchy dot net
 Status:   Open
 Bug Type: mbstring related
 Operating System: Linux, Windows
 PHP Version:  5.0.5
 New Comment:

I'm not specifying which endianess mb_convert_encoding should use to
convert to ISO. Look:
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');

I'm converting from UTF-16 (LE or BE) to ISO-8859-1. It looks like
mb_convert_encoding is checking BOM field and choosing right encoding
(if you remove BOM field it won't be converted properly for one
endianess). The only problem is that BOM is not ignored.

The first two lines with endianess specified:
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');
are just for convient UTF-16 string creation, please ignore them.


Previous Comments:


[2005-10-07 11:57:10] [EMAIL PROTECTED]

I think this is correct as you are not supposed to supply a BOM if you
specify which endianness your UTF16 stream is in.



[2005-10-07 11:52:16] narzeczony at zabuchy dot net

There is also small typo in documentation but I dont want to open
another bug.
On http://ie.php.net/mbstring this section is repeated twice:

Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description: See above.
Additional note: In contrast to UTF-16, strings are always assumed to
be in big endian form. 

While one should be about UTF-16BE and other about UTF-16LE.



[2005-10-07 11:47:13] narzeczony at zabuchy dot net

Description:

When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2
first bytes of UTF-16 text) should be removed, while
mb_convert_encoding function is trying to convert them.
Problem is similar to bug #22108 but maybe this one can be fixed. 

Reproduce code:
---
$iso_8859_1 = 'Nexor';
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');

//lets convert both to UTF-16
//the only difference is 2 byte long BOM field added at the beggining
// \xFF\xFE for little endian
$utf16LE = \xFF\xFE.$utf16LE;
foreach (str_split($utf16LE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');
var_dump($utf16LE2iso);

echo 'br/';

// \xFE\xFF for big endian
$utf16BE = \xFE\xFF.$utf16BE;
foreach (str_split($utf16BE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16');
var_dump($utf16BE2iso);


Expected result:

255 254 78 0 101 0 120 0 111 0 114 0 -- string(5) Nexor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(5) Nexor


Actual result:
--
255 254 78 0 101 0 120 0 111 0 114 0 -- string(6) ??exor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(6) ?Nexor





-- 
Edit this bug report at http://bugs.php.net/?id=34776edit=1


#34776 [Opn]: mb_convert_encoding - wrong convertion from UTF-16 (problem with BOM)

2005-10-07 Thread derick
 ID:   34776
 Updated by:   [EMAIL PROTECTED]
 Reported By:  narzeczony at zabuchy dot net
 Status:   Open
 Bug Type: mbstring related
 Operating System: Linux, Windows
 PHP Version:  5.0.5
 New Comment:

ah, mbstring has a weird parameter order (dest, src) instead of (src,
dest)... did you try to use iconv perhaps?


Previous Comments:


[2005-10-07 12:33:45] narzeczony at zabuchy dot net

I'm not specifying which endianess mb_convert_encoding should use to
convert to ISO. Look:
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');

I'm converting from UTF-16 (LE or BE) to ISO-8859-1. It looks like
mb_convert_encoding is checking BOM field and choosing right encoding
(if you remove BOM field it won't be converted properly for one
endianess). The only problem is that BOM is not ignored.

The first two lines with endianess specified:
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');
are just for convient UTF-16 string creation, please ignore them.



[2005-10-07 11:57:10] [EMAIL PROTECTED]

I think this is correct as you are not supposed to supply a BOM if you
specify which endianness your UTF16 stream is in.



[2005-10-07 11:52:16] narzeczony at zabuchy dot net

There is also small typo in documentation but I dont want to open
another bug.
On http://ie.php.net/mbstring this section is repeated twice:

Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description: See above.
Additional note: In contrast to UTF-16, strings are always assumed to
be in big endian form. 

While one should be about UTF-16BE and other about UTF-16LE.



[2005-10-07 11:47:13] narzeczony at zabuchy dot net

Description:

When converting from UTF-16 (to ISO-8859-1 for example) BOM section (2
first bytes of UTF-16 text) should be removed, while
mb_convert_encoding function is trying to convert them.
Problem is similar to bug #22108 but maybe this one can be fixed. 

Reproduce code:
---
$iso_8859_1 = 'Nexor';
$utf16LE = mb_convert_encoding($iso_8859_1,'UTF-16LE','ISO-8859-1');
$utf16BE = mb_convert_encoding($iso_8859_1,'UTF-16BE','ISO-8859-1');

//lets convert both to UTF-16
//the only difference is 2 byte long BOM field added at the beggining
// \xFF\xFE for little endian
$utf16LE = \xFF\xFE.$utf16LE;
foreach (str_split($utf16LE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16LE2iso = mb_convert_encoding($utf16LE,'ISO-8859-1','UTF-16');
var_dump($utf16LE2iso);

echo 'br/';

// \xFE\xFF for big endian
$utf16BE = \xFE\xFF.$utf16BE;
foreach (str_split($utf16BE) as $l) {echo ord($l).' ';}
echo ' -- ';
$utf16BE2iso = mb_convert_encoding($utf16BE,'ISO-8859-1','UTF-16');
var_dump($utf16BE2iso);


Expected result:

255 254 78 0 101 0 120 0 111 0 114 0 -- string(5) Nexor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(5) Nexor


Actual result:
--
255 254 78 0 101 0 120 0 111 0 114 0 -- string(6) ??exor
254 255 0 78 0 101 0 120 0 111 0 114 -- string(6) ?Nexor





-- 
Edit this bug report at http://bugs.php.net/?id=34776edit=1