Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

David Starner Wed, 24 Apr 2002 12:10:53 -0700

On Wed, Apr 24, 2002 at 01:37:39PM -0400, [EMAIL PROTECTED] wrote:
> Err, no.  That's not the point, AFAIK.  The point is that traditionally
> in UNIX there hasn't been any sort of "marker" or "tag" in the beginning,
> UNIX files being flat streams of bytes.  The UNIX toolset has been built
> with this principle in mind.  No metadata in the files.  BOM breaks this.


Not at all true. Look at the head of a PNM file, a quintessentailly Unix
file format. PNM, MP3 or PNG files all have metadata identifying them,
and don't break under Unix systems.
 
>   wc -c file1
> 
> would have to skip the BOM not get the a wrong byte count.
> 
>   sort -o file5 file1
> 
> would have to strip the BOM from file1 (but put in pack into file5?)

The wrong byte count? wc -c file1 is basically meaningless on a Unicode
file, but at least you can assume it gives the _byte count_ (including
extraneous things like BOMs). 

More importantly, how do these programs handle newlines? wc -l counts the
number of \x0A's in the file; sort splits the file based on \x0A. This
will produce nothing of value on a UTF-16 file. They could be changed to
work with UTF-16, but they won't be, as UTF-8 works just fine.

The point about file calling it data, not text, was just this; you can't
expect to throw UTF-16 through text tools and get a meaningful result.
That's why UTF-8 was created. The only sane thing to do with a UTF-16
file on Unix is treat as binary data, just like you would a
word-processor file. (Which are stunningly non-Unix, but coming
nonetheless. Probably for the best, though.)

-- 
David Starner - [EMAIL PROTECTED]
"It's not a habit; it's cool; I feel alive. 
If you don't have it you're on the other side." 
- K's Choice (probably referring to the Internet)

Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

Reply via email to