At 11:14 AM 03-22-2001 -0800, Hong Zhang wrote:
>Please not fight on wording. For most encodings I know of, the concept of
>normalization does not even exist. What is your definition of normalization?
To me, the usual definition of "normalization' is conversion of something
into a standard form, especially when there are multiple equivilant forms
it could be in.
Since there are multiple ways within Unicode to express a single character
that are considered (by Unicode) to be identical, conversion into single
common form is necessary for comparison purposes.
Example: The sequence of Unicode code points 006E 0061 0069 0308 0076 0065
and the sequence 006E 0061 00EF 0076 0065 both represent the same string in
Unicode (the english word "naive", with a diaeresis over the i). Both
represent 5-character strings, and both are supposed to compare
identically. However, they use a different sequence of code points to
represent one particular character: the 'i' with a diaeresis: 0069 0308
versus 00EF.
If we have $naive5 and $naive6 be variable containing the two example
strings, what do we want as the value of the following expressions?
$naive5 eq $naive 6;
length($naive5);
length($naive6);
and so forth.
As far as my very limited understanding of the Unicode standard goes, they
should compare equal, and both have a length of 5. But their encoded byte
sequences may not be identical.
>I fully understand this. This is one of the reasons I propose sole UTF-8
>encoding. If length() and substr() depend on string internal encoding,
>are they still useful? Who can handle this magic length().
UTF-8 encoding doesn't fix the above problem. UTF-8 would still encode the
two strings differently, because they have different code point
sequences. For that matter, so would any of the other encoding
suggestions. As such, for the above problem, encoding is pretty much a
non-issue.