Re: PDD 4: Internal data types

Dan Sugalski Tue, 06 Mar 2001 14:06:06 -0800
At 01:21 PM 3/6/2001 -0800, Hong Zhang wrote:
> > Unless I really, *really* misread the unicode standard (which is
>distinctly
> > possible) normalization has nothing to do with encoding,
>
>I understand what you are trying to say. But it is not very easy in
>practice. The normalization has something to do with encoding. If you 
>compare two strings with the same encoding, of course you don't have to 
>care about it. But if you compare two strings with different encodings 
>(what Perl 6 will do), you have to care about it. The 6 character 
>"re`sume`" in latin-1 encoding should equal to 8 characters decomposed 
>unicode string. That is what people would expect. If the language does not 
>handle it, some library will do it.

I was a bit sloppy there. When I spoke of encoding, I meant Unicode 
encoding. It shouldn't matter to most perl code whether a string is UTF-8, 
UTF-16, or UTF-32. It does certainly matter when comparing, say, latin-1 
and Unicode, and for that we'd need to transform to a common encoding.

> > and the encoding
> > we choose doesn't make any difference to the character position, string
> > length, or ord stuff if we define them to work on characters rather than
> > bytes. Which doesn't mean it's not a problem, it's just a different
>problem.
>
>Anyway, that is the problem I tried to raise, different problem is still
>problem.

True, but at least we have the real problem to deal with.

>I am not sure what the character definition you are using. The
>single codepoint "e`" can be expressed by two codepoints in unicode.
>So the ord("e`") will return different value depending on its own encoding.

Maybe. That depends on the rules we put forth for Unicode. If the rule is 
"We always decompose" or "we always combine", then it doesn't matter, since 
perl will mash the string as appropriate.

>All the concept of character position, string length, and ord() stuff
>depend on encoding. If Perl 6 uses only one encoding, everything will be
>just fine. Otherwise, someone has to handle this problem.

It seems to me they depend more on the normalization rules than the 
encoding, though I suppose you could consider normalization part of 
encoding. (I'm not sure that's right, but I'm not sure it's wrong either)

> > >Perl users will have to face all kinds of problem when they try to deal
> > >with individual characters.
> >
> > Most won't, honestly. At a guess, 90% of perl's current userbase doesn't
> > care about Unicode for any reason other than XML,
>
>I totally agree with you on this. That was not my point. What I tried to
>express is what Perl 6 should do for people who do care about it. I like
>to see the solution, be it part of language or some unicode library.

Definitely part of the language, at least as far as your average 
programmer's concerned. I can't think of a reasonable way to do otherwise.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: PDD 4: Internal data types

Reply via email to