> So the original statement was correct. If the file starts with FF FE, > it must be a little-endian encoding; but you can't tell whether it's > UTF-16 or UTF-32.
The original statement was: > > A Unicode text file beginning with FEFF is > > big-endian, and a file beginning with FFFE (not a legal Unicode > > character for any other purpose) is little-endian. If a file starts with FF FE it could be little endian in different formats (but that FF FE *could* represent a BOM that needs to be removed, or *could* represent a real character), or it could indicate a corrupted (or non-Unicode) file. So the original statement is inaccurate. Repeating my other message: > > This is incorrect. Here is a summary of the meaning of those bytes at > the start of text files with different Unicode encoding forms. > > beginning with bytes FE FF: > - UTF-16 => big endian, omitted from contents > - UTF-16BE => ZWNBSP > - UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE => malformed, file > corrupted > > beginning with bytes FF FE: > - UTF-16 => little endian, omitted from contents > - UTF-16LE => ZWNBSP > - UTF-32 => little endian (if followed by bytes 00 00), omitted from > contents > - UTF-32LE => different code points, depending on following bytes > - UTF-16BE, UTF-8, UTF-32BE => malformed, file corrupted Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Rick Cameron" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; "Kenneth Whistler" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, April 10, 2002 09:45 Subject: RE: MS/Unix BOM FAQ again (small fix) So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. - rick cameron -----Original Message----- From: Mark Davis [mailto:[EMAIL PROTECTED]] Sent: Tuesday, 9 April 2002 20:36 To: Kenneth Whistler Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: MS/Unix BOM FAQ again (small fix) Sorry, I meant to write "since UTF-32LE, for example, could start with bytes FF FE." It would be the start of a character like U+1FEFF, which would be FF FE 01 00 Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, April 09, 2002 19:23 Subject: Re: MS/Unix BOM FAQ again (small fix) > > I agree, there are different ways to look at it. But the statement > > > > > > > A Unicode text file beginning with FEFF is big-endian, and a > > > > > file beginning with FFFE (not a legal Unicode > > > > > character for any other purpose) is little-endian > > > > is just plain wrong, since UTF-32, for example, could start with bytes > > FE FF. > > Um, not legally in open interchange. > > Either you have big-endian UTF-32 <FE FF nn mm ..> which would correspond > to U-FEFFnnmm ... -- and that is out-of-range for both Unicode and 10646. > > Or you have little-endian UTF-32 <FE FF nn 00 ..> which would correspond > to U-00nnFFFE ..., where nn could be 00..10, but all such values are > noncharacters, and cannot be used in open interchange. > > So if serialized "Unicode text" starts off <FE FF ...> and purports to be legal, > it cannot be UTF-32, it cannot be UTF-8, and it cannot be little-endian. > > --Ken >