* Mark Tolonen:
"Terry Reedy" <tjre...@udel.edu> wrote in message
news:hnjkuo$n1...@dough.gmane.org...
On 3/14/2010 4:40 PM, Guillermo wrote:
Adding the byte that some call a 'utf-8 bom' makes the file an invalid
utf-8 file.
Not true. From http://unicode.org/faq/utf_bom.html:
Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text
is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising
the BOM will be whatever the Unicode character FEFF is converted into by
that transformation format. In that form, the BOM serves to indicate
both that it is a Unicode file, and which of the formats it is in.
Examples:
BytesEncoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
Well, technically true, and Terry was wrong about "There is no such thing as a
utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a
descriptive term "byte order mark" is an oxymoron for UTF-8. But in this
particular context it's not a descriptive term, and it's not only technically
allowed, as you point out, but sometimes required.
However, some tools are unable to process UTF-8 files with BOM.
The most annoying example is the GCC compiler suite, in particular g++, which in
its Windows MinGW manifestation insists on UTF-8 source code without BOM, while
Microsoft's compiler needs the BOM to recognize the file as UTF-8 -- the only
way I found to satisfy both compilers, apart from a restriction to ASCII or
perhaps Windows ANSI with wide character literals restricted to ASCII
(exploiting a bug in g++ that lets it handle narrow character literals with
non-ASCII chars) is to preprocess the source code. But that's not a general
solution since the g++ preprocessor, via another bug, accepts some constructs
(which then compile nicely) which the compiler doesn't accept when explicit
preprocessing isn't used. So it's a mess.
Cheers,
- Alf
--
http://mail.python.org/mailman/listinfo/python-list