PRODUCING and DESCRIBING UTF-8 with and without BOM

Joseph Boyle Mon, 04 Nov 2002 07:27:38 -0800

Thanks for the dozens of responses discussing consumers' behavior on UTF-8
BOM. This is actually not what I'm concerned with, as I have to take it as a
given that there is both software that wants UTF-8 BOM and software that
doesn't want it.


Could we evaluate the need for separate identifiers for producing or
describing UTF-8 with and without BOM, or viable alternatives to use in
control input to a file encoding converter program or encoding checker
program.

Thanks, Joseph

-----Original Message-----
From: Mark Davis [mailto:mark.davis@;jtcsv.com] 
Sent: Sunday, November 03, 2002 12:25 PM
To: Doug Ewell; Unicode Mailing List
Cc: Murray Sargent; Joseph Boyle
Subject: Re: Names for UTF-8 with and without BOM


Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).

I agree that when the UTC decides that a BOM is *only* to be used as a
signature, and that it would be ok to delete it anywhere in a document (like
a non-character), then we are in much better shape. This was, as a matter of
fact proposed for 3.2, but not approved. If we did that for 4.0, then there
would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
'withoutBOM'.

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent"
<[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 13:27
Subject: Re: Names for UTF-8 with and without BOM


> Mark Davis <mark dot davis at jtcsv dot com> wrote:
>
> > That is not sufficient. The first three bytes could represent a real 
> > content character, ZWNBSP or they could be a BOM. The label doesn't 
> > tell you.
>
> I have never understood under what circumstances a ZWNBSP would ever 
> appear as the first character of a file.  It wouldn't make any sense.  
> A ZWNBSP prevents a word break between the preceding and following 
> characters.  If there *is* no preceding character, then what is the 
> point of the ZWNBSP?
>
> Every time this topic comes up, I have asked why a true ZWNBSP would 
> ever appear as the first character of a file.  The only responses I've 
> heard are:
>
> 1.  It might not be a discrete file, but the second (or successive) 
> piece of a file that was split up for some reason (transmission, 
> etc.).
>
> In that case, the interpreting process should take its encoding cue 
> from the first fragment, and should NEVER reinterpret fragments broken 
> up at arbitrary points.  (Imagine a process modifying a GIF or JPEG 
> file, or converting CR/LF, based on fragments!)  But this is not the 
> point being discussed anyway; the point is whole files.
>
> 2.  It could happen; Unicode allows any character to appear anywhere.
>
> Well, almost anywhere.  But even so, the likelihood of a U+FEFF as 
> ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly 
> small compared to the likelihood that the U+FEFF was intended to be a 
> signature.  The rare case is just too rare to invalidate the heuristic 
> for the much more common case.
>
> In addition, as Michka points out, we now have U+2060 WORD JOINER, 
> whose entire purpose in life is to be used as U+FEFF was formerly 
> used, as a ZWNBSP.  Any new Unicode text should use U+2060 and not 
> U+FEFF as a word joiner.  It's hard to imagine that UTC and WG2 would 
> have standardized this if there was a lot of real-world text that used 
> U+FEFF as ZWNBSP.
>
> -Doug Ewell
>  Fullerton, California
>
>
>

PRODUCING and DESCRIBING UTF-8 with and without BOM

Reply via email to