On 05/25/12 21:43, Ulf Zibis wrote:
To me it seems, that computing the murmur hash is more expensive, especially on short strings, than the old hash algorithm.
It is definitely slower, but also definitely has better statistical properties (fewer collisions when used in hash tables). In its original (C) context, Murmur hash is often about as fast as other string hashes, because it operates on 4byte words rather than 1-byte elements at a time. So even though the per-element cost is much higher, it takes 1/4 the steps. When done on Java char[]'s though, it can only process 2 chars at a time. (Plus, we cannot necessarily read 32bitsat a time because of possible byteswapping.) So it is a struggle to make it only twice as slow. This means that any application that hashes strings only once will be slower than using (old) hashCode. but any application that uses the (cached) hashes more than a few times will tend to run faster than old hashCode version because of higher quality hash codes. A few tests so far confirm this. Because performance is so sensitive to this re-use factor, it is hard to test expected performance impact without other JVM/JDK changes. To get a better handle on this, I put together an emulation (using Unsafe) that stashes murmur3 hash in current String.hash field. This is not exactly correct for String objects that might be concurrently hashed by different threads (one via the hash32 emulation, one via plain hashCode), but might be useful for people to try in contexts where this can't happen. See code at http://gee.cs.oswego.edu/dl/wwwtmp/Althash.java
So I think, a developer should have an influence on the hash algorithm to be used by a hash map,
I've had to make a lot of internal adjustments in hash tables to counteract crummy user hashCode() algorithms over the years, so I think that the less control, the better :-) -Doug
