On Wed, Apr 21, 2004 at 11:04:02AM +0100, Tim Bunce wrote: : > Hashes should handle various types of built-in key strings properly : > by default. : : What is "properly" for string?
The way it oughta, whatever that is... I was aiming to set policy rather than implementation there. :-) : Is it to hash the "sequence of integers" : as if they're 32 bits wide even if they're less? Is that sufficient? That would be one way. The point being that the hash mustn't tell you that two strings are different when they would compare the same, even if they are in different internal representations to begin with. It's okay if the hash occassionally says two strings are the same when in fact they'd compare differently. The actual weakness is likely to be in the definition of comparison rather than the definition of the hash function, especially if we let people specify the standards of comparison for the hash keys. That says that the hash function has to either be weaker than the weakest specifiable comparison, or we have to be able to "weaken" the hash such that it doesn't lie about what might match. That sounds like research... Well, it's probably not that bad. Much like with other sorting problems, all you have to do is keep track of a canonicalized key in addition to the "real" key. The hash is always calculated off of the canonicalized key rather than the actual key. (Whether you choose to store or recreate the canonical key is one of those space/time tradeoffs that "use less" was originally intended to solve...) If Unicode makes your brain hurt, just think of it in terms of case sensitivity. We could have a hash that was case insensitive by always calculating the hash on a lower-cased key, and by doing comparisons between lower-cased keys (notionally, at least). So in Unicode terms, there are probably some speed benefits if you know your keys are already canonicalized to the form required by the comparison and hash functions. That implies that the state of canonicalization must be strongly typed (presumably dynamically in Perl). Canonicalization is one of those things you really don't want to do redundantly. Larry