> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> Linux:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 0501 0d01 1901 1701 2f01 6101 7301 6b01
> 0000010 1e20 1c20 7e01 0a00                    
> 0000018

This is UTF-16LE (little-endian serialisation of UTF-16).
It does *not* conform to 10646 (which only allows for
big-endian serialisations) but does conform to Unicode.

An initial U+FEFF in UTF-16LE (or UTF-16BE) is interpreted
as a character (ZWNBSP) and must be kept.

> FreeBSD:
> 
> $ cat t.txt | iconv -f utf-8 -t utf-16 | hexdump
> 0000000 fffe 0501 0d01 1901 1701 2f01 6101 7301
> 0000010 6b01 1e20 1c20 7e01 0a00               
> 000001a

This is UTF-16[with-byte-order-mark; little-endian],
assuming that there was no U+FEFF in the beginning of
the source file (if there was, this would be UTF-16LE).

The (optional if big-endian) byte-order-mark is to be
removed after detecting the byte order.

Whether byte-order-marks (or more generally: "signatures")
is a good or bad idea is a matter of opinion.  E.g.
Microsoft these days put a "signature" even in UTF-8 encoded
files. However, XML specifies that a byte order mark is
to be used for UTF-16 coded XML files, though it is not
really absolutely necessary (encoding-declarations are
good though; THOSE should have been required).

See also IETF RFC 2781.

Further, a "WORD JOINER" is on its way into 10646 and
Unicode. WORD JOINER is really ZWNBSP, and only that,
never a "signature".

Back to the question at hand:

My opinion is that iconv should accept the label UTF-16BE,
and act according to IETF RFC 2781 for that label. Thus,

iconv -f utf-8 -t utf-16be

should give the same UTF-16 big-endian, signatureless
encoding independent of platform (that has iconv).


                /kent k
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to