At 04:14 PM 3/5/2001 -0800, Hong Zhang wrote:
> > >Here is an example, "re`sume`" takes 6 characters in Latin-1, but
> > >could take 8 characters in Unicode. All Perl functions that directly
> > >deal with character position and length will be sensitive to encoding.
> > >I wonder how we should handle this case.
> >
> > My first inclination is to force normalization on any data we manipulate.
>
>That was one of the reasons I proposed UTF-8 string encoding. If we don't
>do normalization (by keeping multiple encoding), we have to avoid using
>character position, string length, ord(), since they are encoding specific.
Unless I really, *really* misread the unicode standard (which is distinctly
possible) normalization has nothing to do with encoding, and the encoding
we choose doesn't make any difference to the character position, string
length, or ord stuff if we define them to work on characters rather than
bytes. Which doesn't mean it's not a problem, it's just a different problem.
>Perl users will have to face all kinds of problem when they try to deal
>with individual characters.
Most won't, honestly. At a guess, 90% of perl's current userbase doesn't
care about Unicode for any reason other than XML, and I'd bet most of the
XML users don't realize that XML has anything to do with Unicode in the
first place.
That's not to say perl shouldn't handle this all properly and as easily as
possible, because it should. We just can't do it at the expense of the
current users.
>In any case, we need to make sure that regex not have any problems with
>normalization.
Once we choose a normalized format, that's half the battle. Granted, it's
the easy half... :)
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk