Tim is correct of course. I was speaking only of what is required for our
parser to figure it out, not what is legally required by the spec. I should
have made this more apparent.
----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
Tim Bray <[EMAIL PROTECTED]> on 01/28/2000 11:22:57 AM
Please respond to [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
cc:
Subject: Re: xml encodings, java
At 11:28 AM 1/28/00 -0700, [EMAIL PROTECTED] wrote:
>Slight correction... The BOM is required for UTF-16 only if the XMLDecl
>line (<?xml...) is not present. If the XMLDecl is present then we can
>figure it out from that (though a BOM can also still be present.)
Well, only maybe.
Section 4.3.3 says; "Entities encoded in UTF-16 must begin with the Byte
Order Mark...". But then a couple of paragraphs later, it says
In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is an error for an entity including an encoding
declaration to be presented to the XML processor in an encoding other than
that named in the declaration, for an encoding declaration to occur
other than at the beginning of an external entity, or for an entity which
begins with neither a Byte Order Mark nor an encoding declaration to use
an encoding other than UTF-8.
So you could decide that this says that if you have an external signal,
you could omit the BOM. And some purists over in the IETF who generally
disapprove of the BOM and think that receiving software should just shut
up and rely on transmitting software to tell it what the encoding is have
tried to use this loophole.
The following is the bottom line:
1. Encodings are tricky; this is the one area where it's a good thing
for XML software to be forgiving and if it can be sure it's got the
right encoding, it should try hard to proceed even if this means
bypassing erroneous declarations or forgiving omitted BOMs.
2. It is always a good idea to prefix a UTF-16 entity with a BOM.
3. It is always a bad idea to store or transmit UTF-16 without a BOM.
-Tim