[EMAIL PROTECTED] writes:
>
> With the very same file, the iconv output is different under FreeBSD and
> Linux (both on Intel PIII).
> Linux:
>
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
> 0000010 1e20 1c20 7e01 0a00
> 0000018
This is big-endian UTF-16, without byte order mark. (Yes, Kent.
The "hexdump" utility on x86 systems displays 16-bit little-endian
words. Next time, please use
hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" " 16/1 "%_p" "\n"'
instead of "hexdump".)
> FreeBSD:
>
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
> 0000010 6b01 1e20 1c20 7e01 0a00
> 000001a
This is big-endian UTF-16, with byte order mark.
And with glibc 2.1.96 you get:
$ cat t.txt | /glibc22/bin/iconv -f utf-8 -t utf-16 | hexdump
0000000 0105 010d 0119 0117 012f 0161 0173 016b
0000010 201e 201c 017e 000a
0000018
which is little-endian UTF-16, without byte order mark.
> I want to know which output is the right one according to specs
> or common sense or even both :).
The spec is RFC 2781. It says:
3.3: "Any labelling application that uses UTF-16 character encoding, and
puts an explicit charset label on the text, and does not know the
serialization order of the characters in text, MUST label the text as
"UTF-16", and SHOULD make sure the text starts with 0xFEFF."
4.3: "If the first two octets of the text is not 0xFE followed by
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
interpreted as being big-endian."
I think glibc is wrong here and ought to prefix the output with
0xFEFF, like it does when converting to "UNICODE" instead of "UTF-16".
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/