Re: MS/Unix BOM FAQ again (small fix)

Mark Davis Wed, 10 Apr 2002 13:00:58 -0700

> So the original statement was correct. If the file starts with FF
FE,
> it must be a little-endian encoding; but you can't tell whether it's
> UTF-16 or UTF-32.


The original statement was:

> > A Unicode text file beginning with FEFF is
> > big-endian, and a file beginning with FFFE (not a legal Unicode
> > character for any other purpose) is little-endian.

If a file starts with FF FE it could be little endian in different
formats (but that FF FE *could* represent a BOM that needs to be
removed, or *could* represent a real character), or it could indicate
a corrupted (or non-Unicode) file. So the original statement is
inaccurate.

Repeating my other message:

>
> This is incorrect. Here is a summary of the meaning of those bytes
at
> the start of text files with different Unicode encoding forms.
>
> beginning with bytes FE FF:
> - UTF-16 => big endian, omitted from contents
> - UTF-16BE => ZWNBSP
> - UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE => malformed, file
> corrupted
>
> beginning with bytes FF FE:
> - UTF-16 => little endian, omitted from contents
> - UTF-16LE => ZWNBSP
> - UTF-32 => little endian (if followed by bytes 00 00), omitted from
> contents
> - UTF-32LE => different code points, depending on following bytes
> - UTF-16BE, UTF-8, UTF-32BE => malformed, file corrupted

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Rick Cameron" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Kenneth Whistler"
<[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, April 10, 2002 09:45
Subject: RE: MS/Unix BOM FAQ again (small fix)


So the original statement was correct. If the file starts with FF FE,
it
must be a little-endian encoding; but you can't tell whether it's
UTF-16 or
UTF-32.

- rick cameron

-----Original Message-----
From: Mark Davis [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, 9 April 2002 20:36
To: Kenneth Whistler
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: MS/Unix BOM FAQ again (small fix)


Sorry, I meant to write "since UTF-32LE, for example, could start with
bytes
FF FE."

It would be the start of a character like U+1FEFF, which would be

FF FE 01 00

Mark

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, April 09, 2002 19:23
Subject: Re: MS/Unix BOM FAQ again (small fix)


> > I agree, there are different ways to look at it. But the statement
> >
> > > > > A Unicode text file beginning with FEFF is big-endian, and a
> > > > > file beginning with FFFE (not a legal
Unicode
> > > > > character for any other purpose) is little-endian
> >
> > is just plain wrong, since UTF-32, for example, could start with
bytes
> > FE FF.
>
> Um, not legally in open interchange.
>
> Either you have big-endian UTF-32 <FE FF nn mm ..> which would
correspond
> to U-FEFFnnmm ... -- and that is out-of-range for both Unicode and
10646.
>
> Or you have little-endian UTF-32 <FE FF nn 00 ..> which would
correspond
> to U-00nnFFFE ..., where nn could be 00..10, but all such values are
> noncharacters, and cannot be used in open interchange.
>
> So if serialized "Unicode text" starts off <FE FF ...> and purports
to be legal,
> it cannot be UTF-32, it cannot be UTF-8, and it cannot be
little-endian.
>
> --Ken
>

Re: MS/Unix BOM FAQ again (small fix)

Reply via email to