On Jul 17, 2007, at 8:40 PM, Yonik Seeley wrote:

On 7/17/07, DM Smith <[EMAIL PROTECTED]> wrote:
According to the UTF-8 spec \uFEFF is not a BOM. In UTF-8 the byte
order is always the same.

But there is a BOM for UTF-8 (even though there is no endian
component, it does serve as a marker indicating the text file is
unicode text encoded in UTF-8).

http://unicode.org/faq/utf_bom.html#29

This is all rather academic at this point as you have fixed the problem.

I stand corrected \uFEFF (the code point) is the BOM for all UTF, with its representation differing by encoding. But UTF-8 byte order is always the same, regardless of the presence of the BOM.

According to the Unicode 5.0 Standard book, Chapter 13, Section 13.6, the byte sequence of the BOM for UTF-8 is EF BB BF (3 bytes) and for UTF-16 it is FE FF or FF FE (2 bytes). It appears that the byte sequence is unique for each unicode representation.

See http://www.unicode.org/unicode/uni2book/ch13.pdf#BOM

I frequently will see FE FF at the beginning of UTF-8 files. I have only seen MS editors add this. This is wrong for UTF-8 files. I was assuming that this was the junk at the beginning of the file.

But, the junk at the beginning of the file was C2 BF. Not at all sure what this would be.






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to