Re: UTF-8 bomb showing up after :%!sort

Tony Mechelynck Tue, 20 May 2008 07:51:11 -0700

On 20/05/08 15:22, Mike Williams wrote:
> On 20/05/2008 12:06, Tony Mechelynck wrote:
>> On 19/05/08 23:01, Bram Moolenaar wrote:
>> [...]
>>> I'm not sure if Vim should detect (and remove) a BOM halfway a file.
>>> You can get it with some filter commands and concatenating files.
>>> Perhaps we need a command ":delboms"?  And ":delbombs" for people who
>>> can't remember the command name :-).
>>>
>> A BOM halfway a file, if it is for the same encoding and endianness as
>> the file, is a valid (though deprecated) Unicode codepoint, U+FEFF
>> ZERO-WIDTH NO-BREAK SPACE. Removing it could conceivably "join" the
>> adjoining words, which would have a bearing for character shape in some
>> scripts like Arabic or, IIUC, Devanagari. It should therefore not be
>> lightheartedly or thoughtlessly removed.
>>
>> A BOM halfway a file, for the same encoding but the opposite endianness
>> as what comes before, has been suggested as an "endianness change"
>> marker, but IIUC this use never did it into the Unicode standard. Yet it
>> could happen if files of opposite endianness are concatenated by mistake.
>
> The Unicode standard effectively defines it as follows (from section
> 16.8 - http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf).  A BOM at
> the start of a file indicates the file encoding only where there is no
> external information on the encoding used.  If there is external
> information that defines the file encoding, then the initial U+FEFF code
> point does not act as a BOM but as a ZERO-WIDTH NO-BREAK SPACE.  All
> occurrences of U+FEFF after the first codepoint are treated as
> ZERO-WIDTH NO-BREAK SPACE, they are not signal endianness changes within
> the file.
>
> However, systems may define additional semantics, but those semantics
> would be specific to those systems and the serialized data would not be
> Unicode conformant.
>
> TTFN
>
> Mike


Note that a ZERO-WIDTH NO-BREAK SPACE at the start of a file is 
essentially a no-op. Other standards may thus speak of "disregard the 
BOM" in favour of other encoding info if present, without effectively 
contradicting the above. For instance, IIUC, a web page can have its 
encoding defined in three different ways:

- via an HTTP "Content-Type" header with "charset=" attribute;
- via a <meta http-equiv="Content-Type"> tag with "charset=" as part of 
its "content=" attribute;
- by a BOM.

Any number of these may be present, and (IIUC) W3C standards mandate in 
which order of priority they should be used.


Best regards,
Tony.
-- 
"I know the answer!  The answer lies within the heart of all mankind!
The answer is twelve?  I think I'm in the wrong building."
                -- Charles Schulz

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: UTF-8 bomb showing up after :%!sort

Raspunde prin e-mail lui