hash values and comparisons of strings

Steve Fink Thu, 23 May 2002 14:19:13 -0700

The hashtable currently has correctness and efficiency problems
resulting from string encodings (and their interaction with garbage
collection).


Correctness example:

Currently, string hash values are computed by going straight to the
actual data and running a fairly typical hashing algorithm. This
results in identical strings with different encodings having differing
hash values. Not only that, but when walking through the chain of
buckets to look up a key, encoding *is* taken into account and so if
you happen to map to the same bucket as an equivalent string, you will
return it. So if your encoding is different, sometimes you'll get
found, sometimes you won't. A clear bug, fixable by canonicalizing the
encoding before computing the hash value.

Performance example:

When resizing a hash, all existings buckets are swept through and each
one's hash value is checked to see if it needs to be moved. Every
comparison potentially requires a transcoding. Not only that, but
transcoding can trigger a collection, so you always have to go through
the Buffer header structure to find each bucket, even when walking
along a chain. Assuming the overwhelmingly common case is no
transcoding (and hence no collection), this slows hashes by
interfering with compiler and hardware optimizations.

Memory example:

Right now, we seem to use STRINGs with a 'singlebyte' encoding for raw
binary data (opaque bitstrings with no assumed semantics). If I
understand correctly, the 'singlebyte' encoding is unusable as a
canonical encoding because it cannot hold the full semantics of eg a
UTF-32 string. So if your hash keys are large binary wads, they get
exploded by a factor of 4 when canonicalized to the most likely
candidate canonical encoding, UTF-32.

Is my understanding of the situation correct? If so, what's the best
way to get correct, fast hashes? And as a tertiary concern, can they
be small too?

Brainstormed possibilities for fixing the correctness bugs and
changing time/space utilization:

 - The obvious solution: always use UTF-32 as the encoding for hash
keys. (sometimes faster, sometimes slower, uses more space)

 - For hashing and/or comparison, go through the string_ord interface
to pull out each codepoint in turn. (slow, uses least space)

 - Associate a canonical encoding with each hashtable. If it is not
given on creation, initialize it to the encoding of the first string
stored in the hash. When a string with a different encoding is stored
in the hash, transcode it. If it cannot be transcoded (because the
hash's canonical encoding is incapable of storing the new key), widen
the hash's canonical encoding and transcode all previously stored
keys. (faster for more things, uses intermediate space)

 - Add a string_hash entry to the encoding vtable. This is probably a
good idea in any case; hashes really have no business looking at their
keys' private parts. (faster for some things, no effect on space, more
overhead for adding encodings, more likely to stay correct with new
encodings)

 - As above, but assert that the string_hash vtable entry is
guaranteed to not trigger a collection. (Preferably by not allocating
any memory in the first place!) I think this really only helps the
resizing case, since normally there are more string comparisons than
hash computations anyway. (faster lookups, fast resizes, complicates
encoding implementation)

 - Provide a string_compare that is guaranteed to not trigger a
collection. For example, implement this by providing iterator entries
in the encoding vtable, and iterate through the codepoints of the two
strings in parallel. But that could be pretty slow too. And you have
to put the iterator structures somewhere without triggering a
collection! (unknown effect on speed, uses least space)

A last comment - with GC, speed and space are more interdependent than
usual. The more space you use, the more frequently you need to
collect.

hash values and comparisons of strings

Reply via email to