On 27/06/10 21:21, Benjamin R. Haskell wrote:
On Sun, 27 Jun 2010, Tony Mechelynck wrote:

On 03/05/10 23:45, Lech Lorens wrote:
[...]
I might be totally wrong basing my understanding of BOM and
character sets mainly on Wikipedia, but I thought that setting
'bomb' for utf-8 encoded files (which does not pose a risk of
misinterpreting the contents due to endianness difference) didn't
make much sense. For utf-16 that would be another thing.

http://en.wikipedia.org/wiki/Byte-order_mark


Notwithstanding its name, the BOM provides more than just endianness
detection. Actually, it is an "encoding signal" which allows detecting
all five of the following encodings, assuming a UTF-16le file won't
start with a NULL:

utf-16be    FE FF
utf-16le    FF FE
utf-8       EF BB BF
utf-32be    00 00 FE FF
utf-32le    FF FE 00 00

For instance, when I was still on XP, I noticed that WordPad could
read UTF-8 files but only if they started with a BOM. When writing
what it called "Unicode", what it produced was UTF-16le with BOM.

Any file starting 0xEF 0xBB 0xBF can be assumed to be in UTF-8.
Distinguishing UTF-8 from Latin1 or Windows-1252 would otherwise
require scanning the whole file, checking for invalid UTF-8 byte
sequences.

Quoting the same Wikipedia article Lech mentioned:

"While [the] Unicode standard allows BOM in UTF-8, it does not require
or recommend it."

and paraphrasing the rest of that paragraph:

Using a BOM as the first character of a UTF-8-encoded file can cause
problems with the shebang line[1] in Unix-like systems.  And
UTF-8-capable software is often written to assume UTF-8 unless otherwise
directed, so the U+FEFF character at the start of the stream is often
interpreted incorrectly.

The Unicode UTF-{8,16,32}&  BOM FAQ probably worded it better than
Wikipedia or I[2].


Yes, a UTF-8 BOM will interfere with any software that has no knowledge of Unicode and expects some particular "magic bytes" at the start, or simply won't accept 0xEF 0xBB 0xBF at the start of a document. The #! shebang is just one example.

OTOH, in filetypes where UTF-8 is but one possibility among many, the BOM is useful to specify the encoding or to confirm what was set otherwise. Examples:

- HTML charset can be set by the HTTP "Content-Type" header (in an HTTP or HTTPS transaction extrernal to the file), in a <meta http-equiv="Content-Type" content="text/html; charset=something"> tag (replacing "something" by the charset) within the <head> section, or by a BOM. There are even official priority rules that tell browsers what to do when two or three of the above are present (and they are necessary, because -I'm told- some braindead hosts will send "Content-Type: text/html; charset=iso-8859-1" for any *.htm or *.html file regardless of BOM or <meta> tags).

- CSS charset can be set by a BOM.

- XML charset can be set (IIRC) by a <? header line or by a BOM

- XHTML is both HTML and XML so the methods of both apply to it.

Personally I use the following rules of thumb:

- Add a BOM to Unicode files meant for use by a browser.
- Don't add it to UTF-8 files mostly in US-ASCII (possibly with codepoints above 0x7F in literals and comments) if they're meant for use by a shell, the 'make' utility, or a compiler.
- Some Windows programs won't read UTF-8 correctly unless a BOM is present.
- On Windows, when a system file is said to be in 'Unicode' that usually means UTF-16le with BOM. - Vim helpfiles in a single directory must either all have a BOM, or (recommended) all lack a BOM. If some have one and others not, the ":helptags" command will abort with an error.

This does not explicitly cover all cases; when it doesn't (or in the cases where some of the above rules conflict), I proceed by analogy and by trial and error.


Best regards,
Tony.
--
One man's brain plus one other will produce one half as many ideas as
one man would have produced alone.  These two plus two more will
produce half again as many ideas.  These four plus four more begin to
represent a creative meeting, and the ratio changes to one quarter as
many ...
                -- Anthony Chevins

--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Reply via email to