Re: PDD 4: Internal data types

Hong Zhang Thu, 22 Mar 2001 10:58:13 -0800
> > The normalization has something to do with encoding. If you compare two
> > strings with the same encoding, of course you don't have to care about
it.
>
> Of course you do. Think about it.

I said "you don't have to". You can use "==" for codepoint comparison, and
something like "Normalizer.compare(a, b)" for lexical comparison, like Java.
It may not be the best solution, but it is doable and acceptable.

> If I'm comparing "(Greek letter lower case alpha with tonos)" with "(Greek
> letter lower case alpha)(+tonos)" I want them to compare equal. One string
is
> normalized, the other isn't; how they're encoded is irrelevant, you still
have
> to care about normalization. (This is where Perl 5 currently falls over)
>
> Normalization has utterly nothing at all to do with encoding. Nothing.

Please not fight on wording. For most encodings I know of, the concept of
normalization does not even exist. What is your definition of normalization?

> Now, since we have to normalize strings in some cases (like the comparison
> above) when the user hasn't explicitly asked for it, let's not make things
> like length() and substr() dependent on whether or not the string is
> normalized, eh? The *last* thing I want to happen is this:
>
>     $a = "(Greek letter lower case alpha with tonos)"
>     print length $a; # 1
>     if ($a eq "(Greek letter lower case alpha)(+tonos)") {
>         # (Which it damned well ought to)
>
>         print length $a; # 2! HA! Surprise! $a had to be normalized!
>     }

I fully understand this. This is one of the reasons I propose sole UTF-8
encoding. If length() and substr() depend on string internal encoding,
are they still useful? Who can handle this magic length().

I still believe UTF-8 is the best choice. Random string access is just
not important, at least, to me.

Let's not fight on string encoding. I like to see some suggestions about
how to handle normalization transparently. Making length()/substr() depend
on encoding/normalization (whatever they are) does not make sense to me.

Hong
Re: PDD 4: Internal data types

Reply via email to