* Mark Tolonen:

"Terry Reedy" <tjre...@udel.edu> wrote in message news:hnjkuo$n1...@dough.gmane.org...
On 3/14/2010 4:40 PM, Guillermo wrote:
Adding the byte that some call a 'utf-8 bom' makes the file an invalid utf-8 file.

Not true.  From http://unicode.org/faq/utf_bom.html:

Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples:
BytesEncoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF            UTF-16, big-endian
FF FE            UTF-16, little-endian
EF BB BF      UTF-8

Well, technically true, and Terry was wrong about "There is no such thing as a utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a descriptive term "byte order mark" is an oxymoron for UTF-8. But in this particular context it's not a descriptive term, and it's not only technically allowed, as you point out, but sometimes required.

However, some tools are unable to process UTF-8 files with BOM.

The most annoying example is the GCC compiler suite, in particular g++, which in its Windows MinGW manifestation insists on UTF-8 source code without BOM, while Microsoft's compiler needs the BOM to recognize the file as UTF-8 -- the only way I found to satisfy both compilers, apart from a restriction to ASCII or perhaps Windows ANSI with wide character literals restricted to ASCII (exploiting a bug in g++ that lets it handle narrow character literals with non-ASCII chars) is to preprocess the source code. But that's not a general solution since the g++ preprocessor, via another bug, accepts some constructs (which then compile nicely) which the compiler doesn't accept when explicit preprocessing isn't used. So it's a mess.


- Alf

Reply via email to