On Monday, 04. March 2002 19:59, Tod Harter wrote:
> > So UTF-8 is a hack, but UTF-16 not - what makes you think so? Perhaps you
> > are uncomfortable with the idea of characters with differing byte length?
>
> It is MUCH easier to write low level code that deals with fixed width
> chars, yes. This is fundamentally why i18n is such a big deal is that its a
> LOT harder to reliably write algorithms that have to deal with all sorts of
> different ways of representing data! One standard fundamental data type, ie
> short, int, long, whatever you want to call it makes things MUCH easier.
Ah, well, then. Why bother? Use wchar_t in C. Use 'my $foo' in perl. Use
'string foo' in python. There you have your fundamental data type. If all
you're talking about is the API, then just leave the transcoding to others.
Perl does handle all that stuff for you already. C does, too (if you learn
the wcs stuff, and if it works, which I don't know). The API problem is NOT
'1 char != N bytes' but '1 char != 1 byte' - when programmers realize that
bytes are something different than chars, there is no problem anymore,
because then programmers think more abstract, in terms of chars instead of
bytes. The actual encoding is irrelevant then. And by the way, resistance is
futile.
> I actually do know about UCS etc etc etc ;o). I suspect the world will not
> really ever be too concerned with the data processing of Elvish, or
> Klingon...
Up to the point when you want to sell a CMS to a historian, or a video store,
or a library. I have done Unicode stuff for university libraries, and it is
rather astounding what they demand. (Unfortunately I had to disappoint them a
bit, because Java's unicode support isn't all that good, main problem being
persuading VMs to actually get them chars on the screen...)
Unicode wasn't done for the fun of it - there is demand, and you better not
neglect it, else you lose a chance for a job.
> Yeah, actually I bet you any money thats exactly what we will end up with!
> Not the least because eventually people will implement a lot of what you
> call text manipulation in hardware, and I can pretty much guarantee you
> that silicon designers are NOT going to mess around with variable width
> character sets.
They will have to. Talk of Sanskrit - you simply cannot encode this language
with one code point per glyph. The language is too complex. And yes, even
this has practical value, as I sometimes work for an indic professor. And
with the rise of the asian technology market, it will become more and more
important.
> Uh, "comprehensible to the casual eye"? Its comprehensible because your
> text editor understands UTF-8!!! If it understood only UTF-16 then THAT
> would be "comprehensible to the casual eye". I guarantee you can't tell a
> memory location thats on from off with the naked eye, they're about .1
> microns across..... ;lo)
I'm not talking about a text editor. I'm talking about a log file entry with
mixed content. I'm talking about a packet dump of a network protocol for
profiling or debugging. I'm talking about messing with proprietary data
formats (flash, word, whatever) to change stuff inside - though I dislike
that work, it is sometimes neccessary. And I am talking about a text editor -
the one you get when you are somewhere else and just want to take a quick
look at whats going on, the one that doesn't do UTF-8, let alone UTF-16.
Imagine why XML was made a text based format, not a binary one - because it
is comprehensible to the naked eye. So is UTF-8.
--
CU
Joerg
PGP Public Key at http://ich.bin.kein.hoschi.de/~trouble/public_key.asc
PGP Key fingerprint = D34F 57C4 99D8 8F16 E16E 7779 CDDC 41A4 4C48 6F94
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]