--IPiIw4QAe+
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit

Charles Cazabon <[EMAIL PROTECTED]> wrote:

>Dave Sill <[EMAIL PROTECTED]> wrote:
>> 
>> You're right. The "hashing" used here is a simple modulo.
>[...]
>> I can't see that primality would do anything special here.
>
>It does -- a large series of random numbers, modulo some number I, will result
>in an even distribution of results if and only if I is prime.  If I isn't
>prime, the results are skewed noticeably towards the low end.

Hmm.

On first reading that, I didn't believe it. I couldn't imagine how
the primality of the divisor could "magically" guarantee an even
distribution.

The first thing I did was Google for "hash prime modulo even
distribution". That turns up many repetitions of Charles' assertion,
without proof or explanation. I did find one clue, though, at:

http://www.cs.rpi.edu/courses/spring01/cs2/wksht22/wksht22.html

Which says:

  Research has shown that you get a more even distribution of hash
  values, and thus fewer collisions, if you choose your table size to
  be a prime number.

Being a ``Profile, don't speculate'' kind of guy, I decided to write a
little program to test modulo hashes, which is attached to this
message for your entertainment.

The result is that I can't see any effect of primality of the hash
table size on distribution. For example:

  $ ./hash 16
  size=16, reps=100000000, seed=0
  0: 6250114
  1: 6250151
  2: 6249941
  3: 6249981
  4: 6249971
  5: 6250134
  6: 6250221
  7: 6250195
  8: 6249542
  9: 6249840
  10: 6250200
  11: 6249700
  12: 6250055
  13: 6250101
  14: 6249832
  15: 6250022
  mean=6250000.000000, variance=36840.000000
  stddev=191.937485 (0.003071%)

The table size, 16, is about as non-prime as you can get, but the
distribution is quite even. Repeating with a table size of 17 shows no
improvement:

  $ ./hash 17
  size=17, reps=100000000, seed=0
  0: 5882787
  1: 5880754
  2: 5883273
  3: 5880598
  4: 5881230
  5: 5880577
  6: 5885196
  7: 5878233
  8: 5874942
  9: 5887715
  10: 5881680
  11: 5889068
  12: 5888613
  13: 5879609
  14: 5882129
  15: 5882443
  16: 5881153
  mean=5882352.000000, variance=13348593.000000
  stddev=3653.572754 (0.062111%)
  
So, I'm not sure exactly what was determined in the research mentioned
above, but it looks to me like everyone's heard the conclusion so many
times that they just accept it. I suspect it's only applicable when
the integers being hashed are fairly close to the size of the table.
  
-Dave


--IPiIw4QAe+
Content-Type: application/octet-stream
Content-Description: modulo hash tester
Content-Disposition: attachment;
        filename="hash.c"
Content-Transfer-Encoding: base64

I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxzdGRsaWIuaD4KI2luY2x1ZGUgPG1hdGgu
aD4KCm1haW4oaW50IGFyZ2MsIGNoYXIgKiphcmd2KQp7CiAgaW50IGhhc2hbMTAwMDBdLCBp
OwogIGludCBzaXplPTE2LCByZXBzPTEwMDAwMDAwMCwgc2VlZD0wOwogIGxvbmcgajsKICBm
bG9hdCBtZWFuLCB2YXJpYW5jZT0wLCBzdGRkZXY7CgogIGlmIChhcmdjID49IDIpIHsKICAg
IHNzY2FuZiAoYXJndlsxXSwgIiVkIiwgJnNpemUpOwogICAgaWYgKHNpemUgPiAxMDAwMCkK
ICAgICAgZXhpdCAoMSk7CiAgfQogIGlmIChhcmdjID49IDMpCiAgICBzc2NhbmYgKGFyZ3Zb
Ml0sICIlZCIsICZyZXBzKTsKICBpZiAoYXJnYyA+PSA0KQogICAgc3NjYW5mIChhcmd2WzNd
LCAiJWQiLCAmc2VlZCk7CiAgbWVhbj1yZXBzL3NpemU7CiAgcHJpbnRmICgic2l6ZT0lZCwg
cmVwcz0lZCwgc2VlZD0lZFxuIiwgc2l6ZSwgcmVwcywgc2VlZCk7CiAgZm9yIChpPTA7IGk8
c2l6ZTsgaSsrKSB7CiAgICBoYXNoW2ldPTA7CiAgfQogIHNyYW5kNDggKHNlZWQpOwogIGZv
ciAoaT0wOyBpPHJlcHM7IGkrKykgewogICAgaj1scmFuZDQ4KCk7CiAgICBoYXNoW2olc2l6
ZV0rKzsKICB9CiAgZm9yIChpPTA7IGk8c2l6ZTsgaSsrKSB7CiAgICBwcmludGYgKCIlZDog
JWRcbiIsIGksIGhhc2hbaV0pOwogICAgdmFyaWFuY2UgKz0gcG93KGhhc2hbaV0gLSBtZWFu
LCAyLjApOwogIH0KICB2YXJpYW5jZSAvPSAoZmxvYXQpc2l6ZS0xOwogIHN0ZGRldj1zcXJ0
KHZhcmlhbmNlKTsKICBwcmludGYgKCJtZWFuPSVmLCB2YXJpYW5jZT0lZlxuIiwgbWVhbiwg
dmFyaWFuY2UpOwogIHByaW50ZiAoInN0ZGRldj0lZiAoJWYlKVxuIiwgc3RkZGV2LCBzdGRk
ZXYvbWVhbioxMDApOwp9Cg==

--IPiIw4QAe+--

Reply via email to