On Wed, Apr 24, 2002 at 01:37:39PM -0400, [EMAIL PROTECTED] wrote: > Err, no. That's not the point, AFAIK. The point is that traditionally > in UNIX there hasn't been any sort of "marker" or "tag" in the beginning, > UNIX files being flat streams of bytes. The UNIX toolset has been built > with this principle in mind. No metadata in the files. BOM breaks this.
Not at all true. Look at the head of a PNM file, a quintessentailly Unix file format. PNM, MP3 or PNG files all have metadata identifying them, and don't break under Unix systems. > wc -c file1 > > would have to skip the BOM not get the a wrong byte count. > > sort -o file5 file1 > > would have to strip the BOM from file1 (but put in pack into file5?) The wrong byte count? wc -c file1 is basically meaningless on a Unicode file, but at least you can assume it gives the _byte count_ (including extraneous things like BOMs). More importantly, how do these programs handle newlines? wc -l counts the number of \x0A's in the file; sort splits the file based on \x0A. This will produce nothing of value on a UTF-16 file. They could be changed to work with UTF-16, but they won't be, as UTF-8 works just fine. The point about file calling it data, not text, was just this; you can't expect to throw UTF-16 through text tools and get a meaningful result. That's why UTF-8 was created. The only sane thing to do with a UTF-16 file on Unix is treat as binary data, just like you would a word-processor file. (Which are stunningly non-Unix, but coming nonetheless. Probably for the best, though.) -- David Starner - [EMAIL PROTECTED] "It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side." - K's Choice (probably referring to the Internet)