Re: [classlib] Some further study on hashcode and HashMap

Tim Ellison Wed, 04 Jul 2007 04:05:17 -0700

Jimmy,Jing Lv wrote:
>     I've done some further study on hashcode and HashMap.  I find it
> very interesting that RI maybe has done something on IdentityHashCode
> just as Sergey said. According to Tim's micro-benchmark on
> JIRA-4046[1], Harmony classlib+IBMVME shows that only 4 bucket are
> used, each contain 250 elements, while RI used up all 16 buckets and
> each of them contains 62-64 elements. What's more, if we print RI's
> objects's hashcode, it was ended with all kind of numbers (e.g.
> 5b8827,147c1db ...) while Harmony+IBMVME return the exact
> address(not sure) which was ended with even number(e.g.
> 77387738,77787778 ...). I haven't try DRLVM at the time (just like
> IBMVME?)


I'd say that the IBM VME should do a better job!

>    So if we do a little modify on the bucket index calculation with
> Tim's test, e.g,
>    int bucket = hashCode >> 1 & (numBuckets - 1);
>    It shows all bucket are used though some of them get much more
> elements than others. And if we modify as:
>     int bucket = hashCode >> 2 & (numBuckets - 1);
>    It just appears as RI do, every bucket got nearly the same number
> of elements
>    If we apply this (hashCode >> 2 & (numBuckets - 1)) to the HashMap
> bucket index algorism, and do some micro benckmark like put/get/remove
> for 10000 times[2], it will  improve 7% (~28s  to ~26s on my desktop).

That's cool (and of course you could rescale the divisor instead of
shifting the hashCode each time and get the same effect).

>    But it is properly not a good idea for HashMap, for some
> user-defined hashcode will work very bad in this algorism. Some very
> famous benchmark, like Specjbb, will at once reduce 10 ~ 15% in score.
> It may not be a proper algorism for classlib buckets index
> calculation.

Agreed, and with apologies for sounding like a broken record, IMHO the
classlib code should not optimize for an inadequate distribution of an
undefined hashcode algorithm (unless it is so prevalent that it is a de
facto standard).

>    So it still requires algorism like we discussed before, though it
> was not easy at all. If we agree on use &(bucketsize -1) (because it
> is much faster than % prime),

I think we have agreed on that...

> we should find out a way to gather all
> 32-bit information into log(2,bucketsize) bits. Operator '+', '|',
> '^', '>>' are used for this target (they are fast themselves). However
> bucketsize are too small usually (at the beginning it was 16, 4-bit)
> to store all 32-bit differences, we have to lose some information
> after all.

Exactly, no amount of bit-twiddling will help you contain 32-bits of
information in 4-bits!  The knack is knowing which four bits to select
from the hashCode.  Unless you know the distribution of the hashCode
values is in fact a subset of all the possible 32-bit values allowed by
the spec, you can only assume that they are random -- at which point you
can select any four bits.

>    Let's first take two hypothesis: 1) hashcode (and user defined
> hashcode) are distributed averagely themselves 2) however the lower
> bits are more important than higher bits(it is imposible for them to
> be equal, especially in small buckets)

You mean more important to the hashmap implementation because they are
used by &(numBuckets-1) to compute the bucket index, right?

>    Under these two hypothesis, we can focus on lower bits. It solved
> Yuri's problem that upper bits changes may not effects on index of the
> buckets, it does not matter(as maybe happen in most cases). And
> however the index-searching algorism can not be too complex, 4 or 5
> operators may be enough to make hashcode more distributed averagely in
> the buckets, and avoid the cost on the searching.
>    After reading some papers and hash algorism (however most of them
> are too complex and time-consuming for us),  currently I have an
> expression here:
>    index = hashcode ? hashcode >> a ? hashcode >> b ( ? hashcode >> c)
>    where "?" stands for one of operator of '^,+,|,&' and a,b,c stand
> for a number less then 32 , a < b < c (I don't know if last operator
> and c can be omitted).
>    Our task is to find best a,b,c and operators between them to
> improve hashcode performance. Currently the best expression I've find
> is :
>    (hash ^ hash >> 9) + hash >> 23
> or
>    (hash ^ hash >> 9) +( hash >>15 | hash >> 23)
> both of them work quite well on my micro benchmark and improve a
> little in Specjbb(at least not worse than using "hash" it alone :) )
>    But I still believe it can improve a lot.

I think you are optimizing for a particular implementation of
identityHashCode, and the class library code should be agnostic.

>    To sum up, 1) I believe Sergey's idea do helps a lot in
> hash-related classes, which requires VM guys to do some modification
> in IdentityHashCode (a simple guess, object address >> 2 ?)

I agree that the VM should do it's best job at producing a good
distribution of values for identityHashCode.

>     2) for user-define hashcode and HashMap it self, based on
> &(bucketsize -1), we can find some parameter for the expression I list
> above, or find some similar expression to use, so that we can make it
> distributed more averagely in the buckets.

Sorry for being so dumb, but how can you devise an expression that will
produce a better bucket index from a hashcode if you don't know anything
about the characteristics of the hashcode?

Regards,
Tim

Re: [classlib] Some further study on hashcode and HashMap

Reply via email to