@lightness1024 - fair criticism of C++ STL.

modulo prime reduction of the hash code to a table address/array index is slow 
because division is slow. In my tests using modulo can sometimes **triple** the 
entire lookup time for simple integer keys due to the slowness of division! I 
recommend you consult Knuth 's Art Of Computer Programming Volume 3 (section 
6.4 in my 1998 edition, but his treatment mostly dates back to the 60s). He 
motivates why prime may be a good choice and then immediately moves on to a 
multiplicative hash. If you really need to mix the bits a multiply & shift is 
much faster than modulo. While the upper bits of the half-product do have more 
entropy, masking seems about as good in my experience and the code is slightly 
simpler due to the close relation of the mask and table size.

Knuth goes on at some length about choosing multipliers - about 10X as much 
length as the module prime. For some reason (perhaps an overzealous affection 
for number theory by CS theorists that write algo books or maybe just 
simplicity?) "modulo prime" got repeated more (everywhere - in 
classes/lectures/textbooks/code, etc.). modulo prime sadly "stuck" in 
programming culture needlessly. Python even has some array of primes less than 
powers of two. The pedagogical situation seems to be improving in the past 
decade or so.

In the context of Nim's Table, for integers the absolute most trivial identity 
hash is used. If that is not good enough, a user should override system hash 
functions with their own better scrambled hash code producer. Believe it or not 
(I always recommend testing on your own real data..), the trivial hash is often 
plenty fast. While adequacy of randomization is data dependent, with linear 
probing and modern CPU cache prefetching even large collision clusters can be 
scanned quickly (especially compared to many "improved randomness" mitigations).

Here are two examples of improved randomness failing. You could use a super 
expensive cryptographic hash function to get super random hash code bits, but 
the time you spend doing that would likely be much more than the extra time 
from scanning bigger clusters. Another example is double hashing which hops all 
over the table randomly in a very cache unfriendly way. The clusters are 
smaller, but the number of cache misses goes way up. To avoid dealing with 
hardware variation, people, including smart researchers, often use 
**machine-neutral surrogates** for actual time like  "number of comparisons" or 
"cluster sizes". This can create all sorts of confusion and cross-talk when the 
surrogates are not very faithful to real machines.

Also, this is slightly off-topic for Nim, but I looked at your package and you 
probably have a subtle bug. I don't see you counting "tombstone" or deleted 
markers - usually desirable to reorganize the table if there are too many. See 
[this thread](https://forum.nim-lang.org/t/829) for details. You also might 
like to read the [thread on hash functions](https://forum.nim-lang.org/t/1730).

Reply via email to