On 20/05/08 15:22, Mike Williams wrote: > On 20/05/2008 12:06, Tony Mechelynck wrote: >> On 19/05/08 23:01, Bram Moolenaar wrote: >> [...] >>> I'm not sure if Vim should detect (and remove) a BOM halfway a file. >>> You can get it with some filter commands and concatenating files. >>> Perhaps we need a command ":delboms"? And ":delbombs" for people who >>> can't remember the command name :-). >>> >> A BOM halfway a file, if it is for the same encoding and endianness as >> the file, is a valid (though deprecated) Unicode codepoint, U+FEFF >> ZERO-WIDTH NO-BREAK SPACE. Removing it could conceivably "join" the >> adjoining words, which would have a bearing for character shape in some >> scripts like Arabic or, IIUC, Devanagari. It should therefore not be >> lightheartedly or thoughtlessly removed. >> >> A BOM halfway a file, for the same encoding but the opposite endianness >> as what comes before, has been suggested as an "endianness change" >> marker, but IIUC this use never did it into the Unicode standard. Yet it >> could happen if files of opposite endianness are concatenated by mistake. > > The Unicode standard effectively defines it as follows (from section > 16.8 - http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf). A BOM at > the start of a file indicates the file encoding only where there is no > external information on the encoding used. If there is external > information that defines the file encoding, then the initial U+FEFF code > point does not act as a BOM but as a ZERO-WIDTH NO-BREAK SPACE. All > occurrences of U+FEFF after the first codepoint are treated as > ZERO-WIDTH NO-BREAK SPACE, they are not signal endianness changes within > the file. > > However, systems may define additional semantics, but those semantics > would be specific to those systems and the serialized data would not be > Unicode conformant. > > TTFN > > Mike
Note that a ZERO-WIDTH NO-BREAK SPACE at the start of a file is essentially a no-op. Other standards may thus speak of "disregard the BOM" in favour of other encoding info if present, without effectively contradicting the above. For instance, IIUC, a web page can have its encoding defined in three different ways: - via an HTTP "Content-Type" header with "charset=" attribute; - via a <meta http-equiv="Content-Type"> tag with "charset=" as part of its "content=" attribute; - by a BOM. Any number of these may be present, and (IIUC) W3C standards mandate in which order of priority they should be used. Best regards, Tony. -- "I know the answer! The answer lies within the heart of all mankind! The answer is twelve? I think I'm in the wrong building." -- Charles Schulz --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_dev" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---