On 20/10/09 05:17, pansz wrote:
>
> Tony Mechelynck 写道:
>> As for using UTF-8 with BOM,
>> I have no statistics on it about what other people do, but I found it to
>> be (as the FAQ quoted above said) an excellent signature to mean that a
>> file is in UTF-8. This ought not to conflict with shell scripts, which
>> cannot have any BOM but are (normally) in 7-bit ASCII.
>
> utf-8 comes after ucs2, so there should be a good reason there must be a
> new encoding. by design, utf-8 should overcome some problems:
>
> 1. Zero-terminate string compatible: no zero character '\0' in the
> middle of string, (ucs2 normally have '\0' character inside the string
> which will break many existing functions)
>
> 2. Unix pipe compatible: a file can break into several files, some files
> can concatenate together to form a new file. (if file contains BOM, then
> when it break into two file the second file does not contain BOM. if two
> files contain BOM, then when they concatenate a new file the new file
> contains BOM inside its content. ) the BOM makes it impossible to handle
> text stream properly, hence you should *not* use BOM in unix-alike systems.
>
>
> If you use Linux and insist BOM in utf-8, you'll eventually hit the wall.

I have already noticed personally (and the Unicode Consortium's FAQ also 
mentions) that use of a BOM conflicts with the #! shebang at the start 
of shell scripts. You call that "hitting the wall"? I just call that a 
warning that bash doesn't know about Unicode. My shell scripts are all 
in 7-bit ASCII, so where's the problem? For C source, I have no 
firsthand experience of whether gcc accepts a starting BOM or not, but I 
can always use "\u1234" in the middle of an ASCII string: again, no 
problem for me. I shall accept that I am helped by the fact that I don't 
write Chinese text into program sources or shell scripts; but I do 
occasionally use Chinese text in HTML, and there the presence of a BOM 
before the <!DOCTYPE and <html> lines has never caused me any trouble. I 
also occasionally use UTF-8 for *.txt files, and there I have actually 
found the BOM to be a help in making my browser and printer react the 
way I want them to.

For concatenation of UTF-8 files (which I rarely use if ever) a U+FEFF 
codepoint somewhere in the middle MUST be interpreted as a zero-width 
no-break space, which is deprecated but legal and should not be a 
problem. If the presence of a zero-width no-break space at the start of 
a line other than the first creates problems, then I bet there are worse 
problems than that with either the file, the software handling it, or 
both. And if it is _not_ at the start of a line, then the preceding file 
was missing an end-of-line on its last line, which would have been a 
problem even without a U+FEFF after it.

If most of your UTF-8 files are shell scripts or maybe C/C++ sources 
with Chinese literals and/or Chinese comments in them, then your 
requirements are other than mine, and quite possibly your solutions will 
be different too. You are entitled to your choices, but of course you 
should be conscious of what they imply, the way I try to remain 
conscious of what my choices imply.


Best regards,
Tony.
-- 
Never eat more than you can lift.
                -- Miss Piggy

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply via email to