--IPiIw4QAe+
Content-Type: text/plain; charset=us-ascii
Content-Description: message body text
Content-Transfer-Encoding: 7bit
Charles Cazabon <[EMAIL PROTECTED]> wrote:
>Dave Sill <[EMAIL PROTECTED]> wrote:
>>
>> You're right. The "hashing" used here is a simple modulo.
>[...]
>> I can't see that primality would do anything special here.
>
>It does -- a large series of random numbers, modulo some number I, will result
>in an even distribution of results if and only if I is prime. If I isn't
>prime, the results are skewed noticeably towards the low end.
Hmm.
On first reading that, I didn't believe it. I couldn't imagine how
the primality of the divisor could "magically" guarantee an even
distribution.
The first thing I did was Google for "hash prime modulo even
distribution". That turns up many repetitions of Charles' assertion,
without proof or explanation. I did find one clue, though, at:
http://www.cs.rpi.edu/courses/spring01/cs2/wksht22/wksht22.html
Which says:
Research has shown that you get a more even distribution of hash
values, and thus fewer collisions, if you choose your table size to
be a prime number.
Being a ``Profile, don't speculate'' kind of guy, I decided to write a
little program to test modulo hashes, which is attached to this
message for your entertainment.
The result is that I can't see any effect of primality of the hash
table size on distribution. For example:
$ ./hash 16
size=16, reps=100000000, seed=0
0: 6250114
1: 6250151
2: 6249941
3: 6249981
4: 6249971
5: 6250134
6: 6250221
7: 6250195
8: 6249542
9: 6249840
10: 6250200
11: 6249700
12: 6250055
13: 6250101
14: 6249832
15: 6250022
mean=6250000.000000, variance=36840.000000
stddev=191.937485 (0.003071%)
The table size, 16, is about as non-prime as you can get, but the
distribution is quite even. Repeating with a table size of 17 shows no
improvement:
$ ./hash 17
size=17, reps=100000000, seed=0
0: 5882787
1: 5880754
2: 5883273
3: 5880598
4: 5881230
5: 5880577
6: 5885196
7: 5878233
8: 5874942
9: 5887715
10: 5881680
11: 5889068
12: 5888613
13: 5879609
14: 5882129
15: 5882443
16: 5881153
mean=5882352.000000, variance=13348593.000000
stddev=3653.572754 (0.062111%)
So, I'm not sure exactly what was determined in the research mentioned
above, but it looks to me like everyone's heard the conclusion so many
times that they just accept it. I suspect it's only applicable when
the integers being hashed are fairly close to the size of the table.
-Dave
--IPiIw4QAe+
Content-Type: application/octet-stream
Content-Description: modulo hash tester
Content-Disposition: attachment;
filename="hash.c"
Content-Transfer-Encoding: base64
I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxzdGRsaWIuaD4KI2luY2x1ZGUgPG1hdGgu
aD4KCm1haW4oaW50IGFyZ2MsIGNoYXIgKiphcmd2KQp7CiAgaW50IGhhc2hbMTAwMDBdLCBp
OwogIGludCBzaXplPTE2LCByZXBzPTEwMDAwMDAwMCwgc2VlZD0wOwogIGxvbmcgajsKICBm
bG9hdCBtZWFuLCB2YXJpYW5jZT0wLCBzdGRkZXY7CgogIGlmIChhcmdjID49IDIpIHsKICAg
IHNzY2FuZiAoYXJndlsxXSwgIiVkIiwgJnNpemUpOwogICAgaWYgKHNpemUgPiAxMDAwMCkK
ICAgICAgZXhpdCAoMSk7CiAgfQogIGlmIChhcmdjID49IDMpCiAgICBzc2NhbmYgKGFyZ3Zb
Ml0sICIlZCIsICZyZXBzKTsKICBpZiAoYXJnYyA+PSA0KQogICAgc3NjYW5mIChhcmd2WzNd
LCAiJWQiLCAmc2VlZCk7CiAgbWVhbj1yZXBzL3NpemU7CiAgcHJpbnRmICgic2l6ZT0lZCwg
cmVwcz0lZCwgc2VlZD0lZFxuIiwgc2l6ZSwgcmVwcywgc2VlZCk7CiAgZm9yIChpPTA7IGk8
c2l6ZTsgaSsrKSB7CiAgICBoYXNoW2ldPTA7CiAgfQogIHNyYW5kNDggKHNlZWQpOwogIGZv
ciAoaT0wOyBpPHJlcHM7IGkrKykgewogICAgaj1scmFuZDQ4KCk7CiAgICBoYXNoW2olc2l6
ZV0rKzsKICB9CiAgZm9yIChpPTA7IGk8c2l6ZTsgaSsrKSB7CiAgICBwcmludGYgKCIlZDog
JWRcbiIsIGksIGhhc2hbaV0pOwogICAgdmFyaWFuY2UgKz0gcG93KGhhc2hbaV0gLSBtZWFu
LCAyLjApOwogIH0KICB2YXJpYW5jZSAvPSAoZmxvYXQpc2l6ZS0xOwogIHN0ZGRldj1zcXJ0
KHZhcmlhbmNlKTsKICBwcmludGYgKCJtZWFuPSVmLCB2YXJpYW5jZT0lZlxuIiwgbWVhbiwg
dmFyaWFuY2UpOwogIHByaW50ZiAoInN0ZGRldj0lZiAoJWYlKVxuIiwgc3RkZGV2LCBzdGRk
ZXYvbWVhbioxMDApOwp9Cg==
--IPiIw4QAe+--