At 11:14 AM 03-22-2001 -0800, Hong Zhang wrote:

>Please not fight on wording. For most encodings I know of, the concept of
>normalization does not even exist. What is your definition of normalization?

To me, the usual definition of "normalization' is conversion of something 
into a standard form, especially when there are multiple equivilant forms 
it could be in.

Since there are multiple ways within Unicode to express a single character 
that are considered (by Unicode) to be identical, conversion into  single 
common form is necessary for comparison purposes.

Example:  The sequence of Unicode code points 006E 0061 0069 0308 0076 0065 
and the sequence 006E 0061 00EF 0076 0065 both represent the same string in 
Unicode (the english word "naive", with a diaeresis over the i).  Both 
represent 5-character strings, and both are supposed to compare 
identically.  However, they use a different sequence of code points to 
represent one particular character: the 'i' with a diaeresis: 0069 0308 
versus 00EF.

If we have $naive5 and $naive6 be variable containing the two example 
strings, what do we want as the value of the following expressions?

   $naive5 eq $naive 6;
   length($naive5);
   length($naive6);
and so forth.

As far as my very limited understanding of the Unicode standard goes, they 
should compare equal, and both have a length of 5.  But their encoded byte 
sequences may not be identical.

>I fully understand this. This is one of the reasons I propose sole UTF-8
>encoding. If length() and substr() depend on string internal encoding,
>are they still useful? Who can handle this magic length().

UTF-8 encoding doesn't fix the above problem.  UTF-8 would still encode the 
two strings differently, because they have different code point 
sequences.  For that matter, so would any of the other encoding 
suggestions.   As such, for the above problem, encoding is pretty much a 
non-issue.

Reply via email to