Re: Review Request CR#7118743 : Alternative Hashing for String with Hash-based Maps

Mike Duigou Tue, 29 May 2012 14:26:36 -0700

On May 28 2012, at 12:34 , Doug Lea wrote:

> On 05/25/12 21:43, Ulf Zibis wrote:
>> To me it seems, that computing the murmur hash is more expensive, especially 
>> on
>> short strings, than the old hash algorithm.
> 
> It is definitely slower, but also definitely has better statistical
> properties (fewer collisions when used in hash tables).
> 
> In its original (C) context, Murmur hash is often about as fast as other
> string hashes, because it operates on 4byte words rather than 1-byte
> elements at a time. So even though the per-element cost is much higher, it
> takes 1/4 the steps. When done on Java char[]'s though, it can only
> process 2 chars at a time. (Plus, we cannot necessarily read 32bitsat a
> time because of possible byteswapping.) So it is a struggle to make
> it only twice as slow.


An intrinsic and/or native version might be added later if it turns out that 
the Java murmur3 implementation is a performance bottleneck. So far it hasn't 
been a particular bottleneck. Implementations are also free to change to a 
different algorithm for performance, security or scalability reasons as they 
see fit.

> This means that any application that hashes strings only once will
> be slower than using (old) hashCode. but any application that
> uses the (cached) hashes more than a few times will tend to run
> faster than old hashCode version because of higher quality hash codes.
> A few tests so far confirm this.

This was my result as well. Caching makes most of the differences between 
String.hashCode (legacy algorithm) and String.hash32 (the murmur3 
implementation) disappear. I actually tested with a version of hashCode that 
recalculated the result 2X, 5X and 10X times in a loop to look at the impact of 
caching. The conclusion was that even a 5X more expensive hashing algorithm was 
bearable as long as caching was present. There are still many cases where 
caching can't be present, such as parsing, where all of the items being hashed 
are new objects with no previously cached hash value. These cases require that 
we provide a reasonably performing hash algorithm. Murmur3 is slower than the 
legacy algorithm but not enough slower to cripple overall usage and it's better 
distribution really shines on larger tables.

>> So I think, a developer should have an influence on the hash algorithm to be
>> used by a hash map,
> 
> I've had to make a lot of internal adjustments in hash tables to
> counteract crummy user hashCode() algorithms over the years,
> so I think that the less control, the better :-)

I am in agreement with Doug. I've seen checked in code with "return 0;" as the 
hashCode implementation for classes used as hash keys. One example that clearly 
didn't understand hashes had a different random number as the hashCode result 
for each class.... 

Mike

Re: Review Request CR#7118743 : Alternative Hashing for String with Hash-based Maps

Reply via email to