In a message dated 2001-04-10 3:04:09 Pacific Daylight Time,
[EMAIL PROTECTED] writes:
> When looking at a document would it be safe to assume that if you found any
> of the following Byte Order Marks
> * 0xFFFE (UCS-2 Little Endian)
> * 0xFEFE (UCS-2 Big Endian)
should be 0xFEFF
> * 0xEFBBBF (UTF-8)
> That the document is encoded with that encoding format. That means that if
I
> found the first 3 octets to be EF BB EF could I assume I am dealing with a
> UTF-8 Document.
That is usually a safe assumption and a good practice, except that if the
first two bytes are 0xFF 0xFE, you should check the next two to see if they
are 0x00 0x00 (which would signify little-endian UCS-4).
Also, think in terms of UTF-16, not UCS-2.
> Apart from UTF and Unicode/UCS encoding formats do any other "legacy"
> character sets use Byte Order Marks?
Good question. I have not heard of any.
To follow up, what about signatures that are not necessarily byte order
marks? UTF-8 does not need a BOM, so the signature 0xEF 0xBB 0xBF is useful
for the purpose Tomás mentioned, to indicate the encoding. Do any other
character sets have such signatures?
-Doug Ewell
Fullerton, California