A number of readers took exception with my claim that
I don't trust users to  write a correct hashCode
implementation.  The assumption was that writing a
proper hash is isn't that hard, and my asserting that
people generally get it wrong was an insult to their
intelligence.  Not so.

Computing a uniform hash is hard.  Thus, lots of solid
developers will end up writing non-uniform hashes,
which will have a non-trivial performance impact on
their applications.

Consider the seemingly trivial example of a primary key
containing two integers:

        public class MyKey implements java.io.Serializable {

          public int a;
          public int b;

          public boolean equals(Object object) {
            if(object == null || !(object instanceof MyKey)) {
              return false;
            }
            MyKey that = (MyKey) object;
            return this.a == that.a && this.b == that.b;
          }

          public int hashCode() {
            return a ^ b;
          }

        }

This was not very hard, was it?  Unfortunately, it was
wrong.

(You might now stop reading, and try to figure out the
bug in the above code.  I will explain it in what follows.
Kudos if you do see the bug.  Hint: the bug is not in the
equals method, but in the hashCode method.)

How could our hash be wrong?  All we are doing is xor'ing
the values?  So long as a and b are uniform, then their xor
should also be uniform, right?  Right.

However, in real applications a and b are generally not
uniform.  Typically, numbers fall into specific ranges, in
the real world, which ranges are typically much smaller than
allowable range of the numeric type.  (If this were not the
case, we would have wanted our numbers to be "long", or
"BigDecimal", to make sure there was room for all the
possible values).

So, in real life a and b will fall into some range.  Let's
say that a falls into the range [0-A], and b falls into
the range [0-B].  Then our hashCode above will fall into
a range:

        [0-(A+B)]

The problem is that we have A*B elements, so we are squeezing
A*B elements into a range of A+B.  Thus, on average, we will
have:

        A*B/(A+B)

elements per hash.  This is horrible.  If we plug in some
numbers (A=100, B=100) we will have:

        10000/200 = 50

So we have 50 elements sharing a single hash.  For large
ranges things get worse.  This will turn all the EJB containers
nice little O(1) lookups (performed on Hashtables, etc.)
into O(n) lookups (since the Hashtables will deteriorate
into linked-lists, due to the overloaded hashing).

So, how can we fix this problem?  Think about it this way:
most numbers in a computer use only a few of right-most digits
available.  Thus, to implement a uniform hash across two
numbers, we need to have some of the digits on the left, and
some on the right.  Logical shifting and/or rotation comes to
mind.

I will leave the final hashCode implementation as an exercise
to the reader.  Hopefully, I made my case.  Implementing a
uniform hash is hard.  If you implement it wrong it can wreak
havoc on your performance.  IMO, it is very much a system-level
function, and something that would be best done by the
application server.

Of course, as I previously mentioned, it was an oversight that
our implementing these methods for the user would prevent the
user from using equals/hashCode to do something fancier than
simply comparing the value of two objects.  This was absolutely
an oversight, and we will look to address this limitation in a
future release.

But my basic point stands, I hope.  Before you decide to flame,
answer honestly:

1) Would you have implemented the above seemingly trivial
hashCode method correctly?

2) How long would it take you to discover that this non-uniform
hashCode implementation was the performance bottleneck in your
system?

-jkw

===========================================================================
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff EJB-INTEREST".  For general help, send email to
[EMAIL PROTECTED] and include in the body of the message "help".

Reply via email to