Re: [CVS ci] hash compare

Mark A. Biggar Wed, 12 Nov 2003 15:13:53 -0800

Nicholas Clark wrote:

On Wed, Nov 12, 2003 at 01:57:14PM -0500, Dan Sugalski wrote:

You're going to run into problems no matter what you do, and as
transcoding could happen with each comparison arguably you need to make a
local copy of the string for each comparison, as otherwise you run the
risk of significant data loss as a string gets transcoded back and forth
across a lossy boundary.

I think that this rules out what I was going to ask/suggested, having read
Leo's patch. I was wondering why there wasn't a straight memcmp of the
two strings whenever their encoding were the same. I presume that there
are some encodings where two different binary representations are considered
"equal", hence we can't blindly assume that a byte compare is sufficient.

It's even worse then that. Unicode has characters that have several different code-point values, even ignoring the encoding issue. See the Unicode standard for a discussion of normalization and string comparisons. Unicode has what are called compatibility characters, where when they added a character set to Unicode as a lump then left in characters that were duplicated elsewhere so that the included set could still be a contiguous code-point range. And there are pre-composed versions of characters that are also buildable from a base character plus one or more combining characters, E.g., the first 256 code-points of Unicode are the same as ASCII Latin-1, so code-point 0x00E4 is the character lower-a-umlaut, but that can also be represented by the pair of code-points 0x0061 & 0x0308 which is lower-a followed by umlaut-combining. This is why Unicode defines Normalization rules for preprocessing a string before comparison.

And even when the sequence of Unicode code-points is the same, some encodings have multiple byte sequences for the same code-point. For example, UTF-8 has two ways to encode a code-point that is larger the 0xFFFF (Unicode as code-points up to 0x10FFF), as either two 16 bit surrogate code points encoded as two 3 byte UTF-8 code sequences or as a single value encoded as a single 4 or 5 byte UTF-8 code sequence. Not to mention malformed UTF-8 codes where a short value is encoded using a longer encoding by not stripping off leading zero bits.

In general to compare Unicode you have to normalize both string first.
As Dan said in his blog, Unicode support is a big pain.

--
[EMAIL PROTECTED]
[EMAIL PROTECTED]

Re: [CVS ci] hash compare

Reply via email to