On 18/05/08 18:41, Ilya Bobir wrote:
> Adri Verhoef wrote:
>> [...]
>>
>> Now do:
>>      :se fileencoding=utf8 bomb
>> [...]
>
> Note, that you probably do not want to use BOM with UTF-8.
> See http://unicode.org/faq/utf_bom.html#29 (Q: Can a UTF-8 data stream
> contain the BOM character (in UTF-8 form)? If yes, then can I still
> assume the remaining UTF-8 bytes are in big-endian order?)
>
> BOM is needed for UTF-16 and UTF-32.

The BOM can also be used in UTF-8, not to determine endianness (which is 
not relevant for UTF-8 -- one could argue that UTF-8 is always 
big-endian) but to distinguish UTF-8 from other encodings including 
UTF-16 and UTF-32. In UTF-8 as in other Unicode encodings, the BOM is 
the codepoint U+FEFF, which is represented in UTF-8 by the three bytes 
EF BB BF in that order.

For instance: when I still had Win XP, I noticed that on that system, 
WordPad, which, when instructed to write a "Unicode text file", will 
always _write_ in UTF-16 (or maybe UCS-2) little-endian, could _read_ 
UTF-8 files correctly if they had a BOM.

A BOM in UTF-8 form could conceivably appear in the middle of a UTF-8 
data stream to represent the deprecated codepoint U+FEFF ZERO-WIDTH 
NO-BREAK SPACE. Another codepoint, U+200C ZERO-WIDTH NON-JOINER (in 
UTF8: E2 80 8C), is, however, nowadays preferred in this function.

Best regards,
Tony.
-- 
TERRY GILLIAM PLAYED: PATSY (ARTHUR'S TRUSTY STEED), THE GREEN KNIGHT
                       SOOTHSAYER, BRIDGEKEEPER, SIR GAWAIN (THE FIRST TO BE
                       KILLED BY THE RABBIT)
                  "Monty Python and the Holy Grail" PYTHON (MONTY) 
PICTURES LTD

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Raspunde prin e-mail lui