On 20/10/09 05:17, pansz wrote: > > Tony Mechelynck 写道: >> As for using UTF-8 with BOM, >> I have no statistics on it about what other people do, but I found it to >> be (as the FAQ quoted above said) an excellent signature to mean that a >> file is in UTF-8. This ought not to conflict with shell scripts, which >> cannot have any BOM but are (normally) in 7-bit ASCII. > > utf-8 comes after ucs2, so there should be a good reason there must be a > new encoding. by design, utf-8 should overcome some problems: > > 1. Zero-terminate string compatible: no zero character '\0' in the > middle of string, (ucs2 normally have '\0' character inside the string > which will break many existing functions) > > 2. Unix pipe compatible: a file can break into several files, some files > can concatenate together to form a new file. (if file contains BOM, then > when it break into two file the second file does not contain BOM. if two > files contain BOM, then when they concatenate a new file the new file > contains BOM inside its content. ) the BOM makes it impossible to handle > text stream properly, hence you should *not* use BOM in unix-alike systems. > > > If you use Linux and insist BOM in utf-8, you'll eventually hit the wall.
I have already noticed personally (and the Unicode Consortium's FAQ also mentions) that use of a BOM conflicts with the #! shebang at the start of shell scripts. You call that "hitting the wall"? I just call that a warning that bash doesn't know about Unicode. My shell scripts are all in 7-bit ASCII, so where's the problem? For C source, I have no firsthand experience of whether gcc accepts a starting BOM or not, but I can always use "\u1234" in the middle of an ASCII string: again, no problem for me. I shall accept that I am helped by the fact that I don't write Chinese text into program sources or shell scripts; but I do occasionally use Chinese text in HTML, and there the presence of a BOM before the <!DOCTYPE and <html> lines has never caused me any trouble. I also occasionally use UTF-8 for *.txt files, and there I have actually found the BOM to be a help in making my browser and printer react the way I want them to. For concatenation of UTF-8 files (which I rarely use if ever) a U+FEFF codepoint somewhere in the middle MUST be interpreted as a zero-width no-break space, which is deprecated but legal and should not be a problem. If the presence of a zero-width no-break space at the start of a line other than the first creates problems, then I bet there are worse problems than that with either the file, the software handling it, or both. And if it is _not_ at the start of a line, then the preceding file was missing an end-of-line on its last line, which would have been a problem even without a U+FEFF after it. If most of your UTF-8 files are shell scripts or maybe C/C++ sources with Chinese literals and/or Chinese comments in them, then your requirements are other than mine, and quite possibly your solutions will be different too. You are entitled to your choices, but of course you should be conscious of what they imply, the way I try to remain conscious of what my choices imply. Best regards, Tony. -- Never eat more than you can lift. -- Miss Piggy --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_use" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---