I recently learned about Fibonacci hashing (aka Knuth's Multiplicative
method) [1] as a way to have a power-of-two number of buckets and
still get decent distribution of incoming hash function values across
buckets for most real world scenarios, even when the user's hash
function is itself not uniformly distributed.  The key point is that
multiplying by the inverse of the golden ratio, and then taking the
high bits of the result with a shift, narrowing of a 32- or 64-bit
hash value down to the power-of-two number of buckets can be done more
efficiently than computing the remainder of a division by a prime
number.

[1] 
https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

But when trying to test it out in m4 sources to see if I could get
faster performance by swapping how I narrow a string hash down to a
bucket number, I noticed two things: hash.c is hard-coded to insisting
on a prime number of buckets, AND it does so in a manner that looks
unsafe:

In hash.c, there is nothing guaranteeing that candidate is odd on
entry:

> static size_t _GL_ATTRIBUTE_PURE
> compute_bucket_size (size_t candidate, const Hash_tuning *tuning)
> {
>   if (!tuning->is_n_buckets)
>     {
>       float new_candidate = candidate / tuning->growth_threshold;
>       if ((float) SIZE_MAX <= new_candidate)
>         goto nomem;
>       candidate = new_candidate;
>     }
>   candidate = next_prime (candidate);

But in next-prime.c, there is this contract:

> /* Return true if CANDIDATE is a prime number or 1.
>    CANDIDATE should be an odd number >= 1.  */
> static bool _GL_ATTRIBUTE_CONST
> is_prime (size_t candidate)
> {
>   size_t divisor = 3;
>   size_t square = divisor * divisor;
> 
>   for (;;)
>     {
>       if (square > candidate)
>         return true;
>       if ((candidate % divisor) == 0)
>         return false;
>       /* Increment divisor by 2.  */
>       divisor++;
>       square += 4 * divisor;
>       divisor++;
>     }
> }

which looks like it will mis-behave when handed an even number.
Should hash.c be patched to ensure 'next_prime (candidate | 1)', at
least as long as it insists on using a prime number of buckets with
integer modulo?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to