Deepak Chand Rathore <deepakr at aztec dot soft dot net> wrote: > But, there is one concern. In some cases the utf8 byte stream starts > with a BOM,( for eg. when we try reading bytes from a text file that > is saved using notepad (using utf8 option )in WIN2k, after first few > bytes( i suppose first 3 bytes), the actual text start. > So how do we detect whether the byte stream starts with a BOM or > not ?? > or the first few bytes represent BOM or the actual text ??
What you are asking is, if a UTF-8 byte stream starts with the character U+FEFF, should that character be treated as a signature (BOM) or as a zero-width no-break space? You'll probably get different responses to this, having to do with tagging or streams broken in the middle. My view is that a zero-width no-break space has *no business* appearing at the start of a text stream. With no character to precede it, what would it prevent a break between? U+FEFF, or specifically the bytes EF BB BF, at the true start of a UTF-8 stream should be always interpreted as a signature. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ I don't speak for the Unicode Consortium.

