Re: iconv output utf-8 -> utf-16, which one is wrong?

Bruno Haible Thu, 02 Nov 2000 09:00:32 -0800
[EMAIL PROTECTED] writes:
> 
> With the very same file, the iconv output is different under FreeBSD and
> Linux (both on Intel PIII).

> Linux:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
> 0000010 1e20 1c20 7e01 0a00                    
> 0000018

This is big-endian UTF-16, without byte order mark. (Yes, Kent.
The "hexdump" utility on x86 systems displays 16-bit little-endian
words. Next time, please use
    hexdump -e '"%06.6_ax  " 16/1 "%02X "' -e '"  " 16/1 "%_p" "\n"'
instead of "hexdump".)

> FreeBSD:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
> 0000010 6b01 1e20 1c20 7e01 0a00               
> 000001a

This is big-endian UTF-16, with byte order mark.

And with glibc 2.1.96 you get:

$ cat t.txt | /glibc22/bin/iconv -f utf-8 -t utf-16 | hexdump
0000000 0105 010d 0119 0117 012f 0161 0173 016b
0000010 201e 201c 017e 000a                    
0000018

which is little-endian UTF-16, without byte order mark.

> I want to know which output is the right one according to specs
> or common sense or even both :).

The spec is RFC 2781. It says:

3.3: "Any labelling application that uses UTF-16 character encoding, and
   puts an explicit charset label on the text, and does not know the
   serialization order of the characters in text, MUST label the text as
   "UTF-16", and SHOULD make sure the text starts with 0xFEFF."

4.3: "If the first two octets of the text is not 0xFE followed by
   0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
   interpreted as being big-endian."

I think glibc is wrong here and ought to prefix the output with
0xFEFF, like it does when converting to "UNICODE" instead of "UTF-16".

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: iconv output utf-8 -> utf-16, which one is wrong?

Reply via email to