On Monday, 04. March 2002 19:59, Tod Harter wrote:

> > So UTF-8 is a hack, but UTF-16 not - what makes you think so? Perhaps you
> > are uncomfortable with the idea of characters with differing byte length?
>
> It is MUCH easier to write low level code that deals with fixed width
> chars, yes. This is fundamentally why i18n is such a big deal is that its a
> LOT harder to reliably write algorithms that have to deal with all sorts of
> different ways of representing data! One standard fundamental data type, ie
> short, int, long, whatever you want to call it makes things MUCH easier.

Ah, well, then. Why bother? Use wchar_t in C. Use 'my $foo' in perl. Use 
'string foo' in python. There you have your fundamental data type. If all 
you're talking about is the API, then just leave the transcoding to others. 
Perl does handle all that stuff for you already. C does, too (if you learn 
the wcs stuff, and if it works, which I don't know). The API problem is NOT 
'1 char != N bytes' but '1 char != 1 byte' - when programmers realize that 
bytes are something different than chars, there is no problem anymore, 
because then programmers think more abstract, in terms of chars instead of 
bytes. The actual encoding is irrelevant then. And by the way, resistance is 
futile.

> I actually do know about UCS etc etc etc ;o). I suspect the world will not
> really ever be too concerned with the data processing of Elvish, or
> Klingon...

Up to the point when you want to sell a CMS to a historian, or a video store, 
or a library. I have done Unicode stuff for university libraries, and it is 
rather astounding what they demand. (Unfortunately I had to disappoint them a 
bit, because Java's unicode support isn't all that good, main problem being 
persuading VMs to actually get them chars on the screen...)
Unicode wasn't done for the fun of it - there is demand, and you better not 
neglect it, else you lose a chance for a job.

> Yeah, actually I bet you any money thats exactly what we will end up with!
> Not the least because eventually people will implement a lot of what you
> call text manipulation in hardware, and I can pretty much guarantee you
> that silicon designers are NOT going to mess around with variable width
> character sets.

They will have to. Talk of Sanskrit - you simply cannot encode this language 
with one code point per glyph. The language is too complex. And yes, even 
this has practical value, as I sometimes work for an indic professor. And 
with the rise of the asian technology market, it will become more and more 
important.

> Uh, "comprehensible to the casual eye"? Its comprehensible because your
> text editor understands UTF-8!!! If it understood only UTF-16 then THAT
> would be "comprehensible to the casual eye". I guarantee you can't tell a
> memory location thats on from off with the naked eye, they're about .1
> microns across..... ;lo)

I'm not talking about a text editor. I'm talking about a log file entry with 
mixed content. I'm talking about a packet dump of a network protocol for 
profiling or debugging. I'm talking about messing with proprietary data 
formats (flash, word, whatever) to change stuff inside - though I dislike 
that work, it is sometimes neccessary. And I am talking about a text editor - 
the one you get when you are somewhere else and just want to take a quick 
look at whats going on, the one that doesn't do UTF-8, let alone UTF-16. 
Imagine why XML was made a text based format, not a binary one - because it 
is comprehensible to the naked eye. So is UTF-8.

-- 
CU
        Joerg

PGP Public Key at http://ich.bin.kein.hoschi.de/~trouble/public_key.asc
PGP Key fingerprint = D34F 57C4 99D8 8F16 E16E  7779 CDDC 41A4 4C48 6F94


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to