It would be useful to have official names to distinguish UTF-8 with and without BOM. (or, with, without, and agnostic) Here are a couple of examples I'm currently involved with:
* I'm writing an encoding checker to validate a long list of text file formats we use internally. HTML and XML only count as one format each; most cases are file formats originated by one of our development groups without regard to encoding issues, and which we've now tried to standardize on UTF-8 with BOM to distinguish from ASCII or codepage legacy files while still allowing legacy files to work. In the list of file formats, the encoding constraint field needs to distinguish UTF-8 with BOM from UTF-8 without BOM. * We need an encoding conversion tool for text files that can output both UTF-8 with BOM and UTF-8 without BOM. Current tools like ICU's uconv do not support output of UTF-8 with BOM. It would be possible to add an input switch for BOM/no BOM distinct from the output charset specifier, but this is an ugly solution as it is not needed for any other encoding, even UTF-16 and UTF-32 which have separate charset names for the with-BOM and without-BOM variants. I've discussed with Markus Scherer who would also prefer distinct charset names as the means to distinguish BOM and no-BOM. Mark Davis introduced UTF-8N for UTF-8 with no BOM a couple of years ago, which seems to have some currency especially on Japanese sites for some reason. This is the only convention I can find, and might adopt it if nothing else is available. However, it does not seem to have any official status with Unicode Consortium or IETF, and while making UTF-8 mean with-BOM would be convenient enough for us internally, I am sure some other users would object strongly. How about if we let UTF-8 keep its current status as neither requiring nor forbidding BOM, make UTF-8N official for no-BOM, and coin another name for with-BOM? Let's call it UTF-8BOM for the moment. Behavior for each would be: UTF-8BOM UTF-8N UTF-8 Producers Produce BOM Don't produce Optional (higher protocols using UTF-8 can recommend) Consumers Consume BOM Don't consume Should probably strip BOM since initial ZWNBSP not likely Checkers Require BOM Forbid Optional (higher protocols using UTF-8 can forbid or require) (I realize UTF-8 Byte Order Mark is an oxymoron, however BOM is established and shorter to type than "signature", and does not cause confusion unless you run into release managers talking about Bills Of Materials. Perhaps it is time to think of three other words starting with B, O, M that make a better explanation.)