Deepak Chand Rathore <deepakr at aztec dot soft dot net> wrote:

> But, there is one concern. In some cases the utf8 byte stream starts
> with a BOM,( for eg. when we try reading bytes from a text file that
> is saved using notepad (using utf8 option )in WIN2k, after first few
> bytes( i suppose first 3 bytes), the actual text start.
> So how do we detect whether the byte stream starts with a BOM or
> not ??
> or the first few bytes represent BOM or the actual text ??

What you are asking is, if a UTF-8 byte stream starts with the character
U+FEFF, should that character be treated as a signature (BOM) or as a
zero-width no-break space?

You'll probably get different responses to this, having to do with
tagging or streams broken in the middle.  My view is that a zero-width
no-break space has *no business* appearing at the start of a text
stream.  With no character to precede it, what would it prevent a break
between?  U+FEFF, or specifically the bytes EF BB BF, at the true start
of a UTF-8 stream should be always interpreted as a signature.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/
 I don't speak for the Unicode Consortium.


Reply via email to