[EMAIL PROTECTED] wrote:

> The BOM is explicitly not to be interpreted as part of
> the text stream. D35 (U3, p47) states (at least for UTF-16):
> 
> "The byte order mark is not considered part of the content of the text."

Absolutely.  What that means is that if there is a BOM, it is not translated
into a character; per contra, when encoding the first character, a BOM is prefixed
to its byte representation.

> The standard doesn't ever discuss the BOM in the context of UTF-8,

See section 13.6 (page 324).

> By the way, I don't know why you singled out U+0020 here; your claim could
> equally have been made about any other character (and would have been
> equally inaccurate).

Any other character, yes; inaccurate, no.

> [U+FEFF U+0020:] An unlikely initial character sequence,

But legal.

> This isn't analogous to UTF-16 since
> D33 - D35 spell out how an initial U+FEFF is to be interpreted (though it
> would be analogous if D33 - D35 didn't make that clear - perhaps that's
> what you meant).

For the "UTF-16" encoding, yes.  For the encodings "UTF-16BE" and "UTF-16LE"
defined in D33-34, no.  However, D35 tolerates using the term "UTF-16" in
either a specific or a generic sense.

> - A UTF-8 file that begins with the byte sequence 0xEF 0xBB 0x BF 0x20 ...
> could be interpreted as either < ZWNBSP U+0020 ... >, or as BOM < U+0020
> ... > (where I'm using angle brackets to denote the start and end of the
> content of text). Furthermore, there is nothing to indicate which
> interpretation is correct. (On this we agree.)

Yes.  And thus new charset labels need to be introduced to distinguish
the two cases. A charset label, as RFC 1345 says, "unambiguously and
completely determines which sequence of characters, if any, is
represented by each possible sequence of n-bit bytes for a certain
value of n."  The label "UTF-8" does not do so.

(I am not to be understood as favoring this result: it would be much
better to suppress 8-BOMs, and talk only of UTF-8.  But that's not what
Unicode 3.0 entails.)

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Reply via email to