Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄
----- Original Message ----- From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes > are being wasted. :-) > > > Little probability that right double quote would appear at the start of a > > document either. Doesn't mean that you are free to delete it (*and* say > that > > you are not modifying the contents). > > Interesting strawman there, Mark -- but there is a huge difference there. > But even if we leave in the notion of it as a character and just deprecate > its usage and people ignore that, then we are talking about a ZERO WIDTH NO > BREAK SPACE. This character has the job of: > > 1) being invisible > 2) not breaking text with it > > So even if it were in there, who cares? I mean, can anyone explain why it > would make a difference? > > The one thing that no one has ever come up with is a reasonable case where > it would be at the beginning of the document *yet* it was not a BOM. > > So we have a clear semantic for it at the beginning of a file -- its a BOM. > Period. > > If there is a higher level protocol as well and the protocol and the BOM > both match, then that is great! Considering how much redundancy there is in > the Unicode standard about some definitions, a redundant marker for a file > seems a very trivial issue. > > If there is a higher level protocol as well and they do not match, then we > are in fantasy land bizarro world, inventing edge cases because we have > nothing better to do. :-) But for the sake of argument, lets pretend its a > real scenario -- in which case we treat it the same way as if your higher > level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an > error. > > Problem solved! > > > I agree that when the UTC decides that a BOM is *only* to be used as a > > signature, and that it would be ok to delete it anywhere in a document > (like > > a non-character), then we are in much better shape. This was, as a matter > of > > fact proposed for 3.2, but not approved. If we did that for 4.0, then > there > > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 > > 'withoutBOM'. > > There is no reason to worry about this case and no need to delete anything. > This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on > the people who think this is a scenario to bring proof that anyone is doing > anything as unrealistic as this. > > There is an easy, clear, and unambigous plan that can be used here which > will always work. For ones lets not opt to complicate it without reason. > > MichKa > > >