Thanks for the dozens of responses discussing consumers' behavior on UTF-8 BOM. This is actually not what I'm concerned with, as I have to take it as a given that there is both software that wants UTF-8 BOM and software that doesn't want it.
Could we evaluate the need for separate identifiers for producing or describing UTF-8 with and without BOM, or viable alternatives to use in control input to a file encoding converter program or encoding checker program. Thanks, Joseph -----Original Message----- From: Mark Davis [mailto:mark.davis@;jtcsv.com] Sent: Sunday, November 03, 2002 12:25 PM To: Doug Ewell; Unicode Mailing List Cc: Murray Sargent; Joseph Boyle Subject: Re: Names for UTF-8 with and without BOM Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). I agree that when the UTC decides that a BOM is *only* to be used as a signature, and that it would be ok to delete it anywhere in a document (like a non-character), then we are in much better shape. This was, as a matter of fact proposed for 3.2, but not approved. If we did that for 4.0, then there would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 'withoutBOM'. Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent" <[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 13:27 Subject: Re: Names for UTF-8 with and without BOM > Mark Davis <mark dot davis at jtcsv dot com> wrote: > > > That is not sufficient. The first three bytes could represent a real > > content character, ZWNBSP or they could be a BOM. The label doesn't > > tell you. > > I have never understood under what circumstances a ZWNBSP would ever > appear as the first character of a file. It wouldn't make any sense. > A ZWNBSP prevents a word break between the preceding and following > characters. If there *is* no preceding character, then what is the > point of the ZWNBSP? > > Every time this topic comes up, I have asked why a true ZWNBSP would > ever appear as the first character of a file. The only responses I've > heard are: > > 1. It might not be a discrete file, but the second (or successive) > piece of a file that was split up for some reason (transmission, > etc.). > > In that case, the interpreting process should take its encoding cue > from the first fragment, and should NEVER reinterpret fragments broken > up at arbitrary points. (Imagine a process modifying a GIF or JPEG > file, or converting CR/LF, based on fragments!) But this is not the > point being discussed anyway; the point is whole files. > > 2. It could happen; Unicode allows any character to appear anywhere. > > Well, almost anywhere. But even so, the likelihood of a U+FEFF as > ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly > small compared to the likelihood that the U+FEFF was intended to be a > signature. The rare case is just too rare to invalidate the heuristic > for the much more common case. > > In addition, as Michka points out, we now have U+2060 WORD JOINER, > whose entire purpose in life is to be used as U+FEFF was formerly > used, as a ZWNBSP. Any new Unicode text should use U+2060 and not > U+FEFF as a word joiner. It's hard to imagine that UTC and WG2 would > have standardized this if there was a lot of real-world text that used > U+FEFF as ZWNBSP. > > -Doug Ewell > Fullerton, California > > >