Re: Names for UTF-8 with and without BOM

Mark Davis Sun, 03 Nov 2002 17:24:08 -0800

> So even if it were in there, who cares? I mean, can anyone explain why it
> would make a difference?


I personally wouldn't care if every instance of "Michael Kaplan" at the
start of a file were deleted. Not the point.

The actual point is that currently, as defined -- not as you would wish for
it to be, the FEFF is an actual character, and in circumstances where it is
not clearly defined for use as a BOM, it cannot be removed without altering
the content of the text.

As I said in another message, the UTC could change this situation by
completely deprecating the use of FEFF as anything but BOM. But it hasn't
done it yet.

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List"
<[EMAIL PROTECTED]>
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM


> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
> are being wasted. :-)
>
> > Little probability that right double quote would appear at the start of
a
> > document either. Doesn't mean that you are free to delete it (*and* say
> that
> > you are not modifying the contents).
>
> Interesting strawman there, Mark -- but there is a huge difference there.
> But even if we leave in the notion of it as a character and just deprecate
> its usage and people ignore that, then we are talking about a ZERO WIDTH
NO
> BREAK SPACE. This character has the job of:
>
> 1) being invisible
> 2) not breaking text with it
>
> So even if it were in there, who cares? I mean, can anyone explain why it
> would make a difference?
>
> The one thing that no one has ever come up with is a reasonable case where
> it would be at the beginning of the document *yet* it was not a BOM.
>
> So we have a clear semantic for it at the beginning of a file -- its a
BOM.
> Period.
>
> If there is a higher level protocol as well and the protocol and the BOM
> both match, then that is great! Considering how much redundancy there is
in
> the Unicode standard about some definitions, a redundant marker for a file
> seems a very trivial issue.
>
> If there is a higher level protocol as well and they do not match, then we
> are in fantasy land bizarro world, inventing edge cases because we have
> nothing better to do. :-)  But for the sake of argument, lets pretend its
a
> real scenario -- in which case we treat it the same way as if your higher
> level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
> error.
>
> Problem solved!
>
> > I agree that when the UTC decides that a BOM is *only* to be used as a
> > signature, and that it would be ok to delete it anywhere in a document
> (like
> > a non-character), then we are in much better shape. This was, as a
matter
> of
> > fact proposed for 3.2, but not approved. If we did that for 4.0, then
> there
> > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> > 'withoutBOM'.
>
> There is no reason to worry about this case and no need to delete
anything.
> This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
> the people who think this is a scenario to bring proof that anyone is
doing
> anything as unrealistic as this.
>
> There is an easy, clear, and unambigous plan that can be used here which
> will always work. For ones lets not opt to complicate it without reason.
>
> MichKa
>
>
>

Re: Names for UTF-8 with and without BOM

Reply via email to