Names for UTF-8 with and without BOM

Joseph Boyle Fri, 01 Nov 2002 12:37:33 -0800

It would be useful to have official names to distinguish UTF-8 with and
without BOM. (or, with, without, and agnostic) Here are a couple of examples
I'm currently involved with:


* I'm writing an encoding checker to validate a long list of text file
formats we use internally. HTML and XML only count as one format each; most
cases are file formats originated by one of our development groups without
regard to encoding issues, and which we've now tried to standardize on UTF-8
with BOM to distinguish from ASCII or codepage legacy files while still
allowing legacy files to work. In the list of file formats, the encoding
constraint field needs to distinguish UTF-8 with BOM from UTF-8 without BOM.
* We need an encoding conversion tool for text files that can output both
UTF-8 with BOM and UTF-8 without BOM. Current tools like ICU's uconv do not
support output of UTF-8 with BOM. It would be possible to add an input
switch for BOM/no BOM distinct from the output charset specifier, but this
is an ugly solution as it is not needed for any other encoding, even UTF-16
and UTF-32 which have separate charset names for the with-BOM and
without-BOM variants. I've discussed with Markus Scherer who would also
prefer distinct charset names as the means to distinguish BOM and no-BOM.


Mark Davis introduced UTF-8N for UTF-8 with no BOM a couple of years ago,
which seems to have some currency especially on Japanese sites for some
reason. This is the only convention I can find, and might adopt it if
nothing else is available. However, it does not seem to have any official
status with Unicode Consortium or IETF, and while making UTF-8 mean with-BOM
would be convenient enough for us internally, I am sure some other users
would object strongly.

How about if we let UTF-8 keep its current status as neither requiring nor
forbidding BOM, make UTF-8N official for no-BOM, and coin another name for
with-BOM? Let's call it UTF-8BOM for the moment. Behavior for each would be:

            UTF-8BOM     UTF-8N        UTF-8    
Producers   Produce BOM  Don't produce Optional (higher protocols using
UTF-8 can recommend)
Consumers   Consume BOM  Don't consume Should probably strip BOM since
initial ZWNBSP not likely
Checkers        Require BOM  Forbid        Optional (higher protocols using
UTF-8 can forbid or require)


(I realize UTF-8 Byte Order Mark is an oxymoron, however BOM is established
and shorter to type than "signature", and does not cause confusion unless
you run into release managers talking about Bills Of Materials. Perhaps it
is time to think of three other words starting with B, O, M that make a
better explanation.)

Names for UTF-8 with and without BOM

Reply via email to