Nicholas Clark wrote:

On Wed, Nov 12, 2003 at 01:57:14PM -0500, Dan Sugalski wrote:


You're going to run into problems no matter what you do, and as
transcoding could happen with each comparison arguably you need to make a
local copy of the string for each comparison, as otherwise you run the
risk of significant data loss as a string gets transcoded back and forth
across a lossy boundary.


I think that this rules out what I was going to ask/suggested, having read
Leo's patch. I was wondering why there wasn't a straight memcmp of the
two strings whenever their encoding were the same. I presume that there
are some encodings where two different binary representations are considered
"equal", hence we can't blindly assume that a byte compare is sufficient.

It's even worse then that. Unicode has characters that have several different code-point values, even ignoring the encoding issue. See the
Unicode standard for a discussion of normalization and string
comparisons. Unicode has what are called compatibility characters,
where when they added a character set to Unicode as a lump then left in
characters that were duplicated elsewhere so that the included set could
still be a contiguous code-point range. And there are pre-composed
versions of characters that are also buildable from a base character
plus one or more combining characters, E.g., the first 256 code-points
of Unicode are the same as ASCII Latin-1, so code-point 0x00E4 is the character lower-a-umlaut, but that can also be represented by the pair
of code-points 0x0061 & 0x0308 which is lower-a followed by umlaut-combining. This is why Unicode defines Normalization rules for
preprocessing a string before comparison.


And even when the sequence of Unicode code-points is the same, some
encodings have multiple byte sequences for the same code-point. For example, UTF-8 has two ways to encode a code-point that is larger the
0xFFFF (Unicode as code-points up to 0x10FFF), as either two 16 bit
surrogate code points encoded as two 3 byte UTF-8 code sequences or as
a single value encoded as a single 4 or 5 byte UTF-8 code sequence.
Not to mention malformed UTF-8 codes where a short value is encoded
using a longer encoding by not stripping off leading zero bits.


In general to compare Unicode you have to normalize both string first.
As Dan said in his blog, Unicode support is a big pain.

--
[EMAIL PROTECTED]
[EMAIL PROTECTED]



Reply via email to