At 04:14 PM 3/5/2001 -0800, Hong Zhang wrote:
> > >Here is an example, "re`sume`" takes 6 characters in Latin-1, but
> > >could take 8 characters in Unicode. All Perl functions that directly
> > >deal with character position and length will be sensitive to encoding.
> > >I wonder how we should handle this case.
> >
> > My first inclination is to force normalization on any data we manipulate.
>
>That was one of the reasons I proposed UTF-8 string encoding. If we don't
>do normalization (by keeping multiple encoding), we have to avoid using
>character position, string length, ord(), since they are encoding specific.

Unless I really, *really* misread the unicode standard (which is distinctly 
possible) normalization has nothing to do with encoding, and the encoding 
we choose doesn't make any difference to the character position, string 
length, or ord stuff if we define them to work on characters rather than 
bytes. Which doesn't mean it's not a problem, it's just a different problem.

>Perl users will have to face all kinds of problem when they try to deal
>with individual characters.

Most won't, honestly. At a guess, 90% of perl's current userbase doesn't 
care about Unicode for any reason other than XML, and I'd bet most of the 
XML users don't realize that XML has anything to do with Unicode in the 
first place.

That's not to say perl shouldn't handle this all properly and as easily as 
possible, because it should. We just can't do it at the expense of the 
current users.

>In any case, we need to make sure that regex not have any problems with
>normalization.

Once we choose a normalized format, that's half the battle. Granted, it's 
the easy half... :)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to