Thanks for highlighting this. I did not use the original paper and based
the implementation on Wikipedia.

I think the issue is that we use i in [0, k); we can correct this by using
i in [1, k]. The order inside the loop would not change but we would have
to decrement i to use in the assignment giving:

x, y := a(δ) MOD m, b(δ) MOD m
for i := 1 .. k
    f[i - 1] := x
    x := (x + y) MOD m
    y := (y + i) MOD m

This would fix the issue but has the inelegant extra computation at the end
of the loop that is unused on the final execution. Note that the current
implementation has this inelegant extra computation as well. We can update
to remove this by reducing the loop size by 1 and using the first index
outside the loop.

When applying this change to the current repo we see 4 failures in the
EnhancedDoubleHasherTest. These are corrected by changing the array of
expected indices.

Regards,

Alex




On Sat, 18 May 2024 at 07:23, Juan Manuel Gimeno Illa <jmgim...@gmail.com>
wrote:

> The code in the implementation of the EnhancedDoubleHasher is based on the
> wikipedia article https://en.wikipedia.org/wiki/Double_hashing.
>
> The code that appears in the article is:
>
> struct key; /// Opaque
> /// Use other data types when needed. (Must be unsigned for guaranteed
> wrapping.)
> extern unsigned int h1(struct key const *), h2(struct key const *);
>
> /// Calculate k hash values from two underlying hash functions
> /// h1() and h2() using enhanced double hashing.  On return,
> ///     hashes[i] = h1(x) + i*h2(x) + (i*i*i - i)/6.
> /// Takes advantage of automatic wrapping (modular reduction)
> /// of unsigned types in C.
> void ext_dbl_hash(struct key const *x, unsigned int hashes[], unsigned int
> n)
> {
>     unsigned int a = h1(x), b = h2(x), i;
>
>     for (i = 0; i < n; i++) {
>         hashes[i] = a;
>         a += b; // Add quadratic difference to get cubic
>         b += i; // Add linear difference to get quadratic
>                 // i++ adds constant difference to get linear
>     }
> }
>
> And it is based on code from the paper "Bloom Filters in Probabilistic
> Verification" by Peter C.
> Dillinger and Panagiotis Manolios (
> https://www.khoury.northeastern.edu/~pete/pub/bloom-filters-verification.pdf
> )
>
> But the code that appears on that paper is:
>
> x, y := a(δ) MOD m, b(δ) MOD m
> f[0] := x
> for i := 1 .. k-1
>     x := (x + y) MOD m
>     y := (y + i) MOD m
>     f[i] := x
>
> which does the assignments to x, y (a, b in the wikipedia article) in a
> different order.
>
> The same algorithm appears the thesis "Adaptive Approximate State Storage"
> (http://peterd.org/pcd-diss.pdf)
>
> algorithm " Enhanced DoubleHashing"
>
> input x : "the element being added or queried"
> input m : "number of bits in bit vector"
> input k : "number of indices to compute"
> output g[0 .. k - 1] : "array of indices"
> uses h1, h2 : "hash functions"
>
> begin
>     a := h1(x)
>     b := h2(x)
>     g[0] := a
>     for i := 1 .. k - 1
>     begin
>         a := (a + b) MOD m
>         b := (b + i) MOD m
>         g[i] := a
>     end
> end
>
> It's easy to see that the computation the enhanced double hash computes
>
>     hash[i] = h1(x) + i * h2(x) + (i^3 - i) / 6
>
> if we use a=h1(x) and b=h2(x), generates the sequence:
>
>     a, a + b, a + 2 * b + 1, a + 3 * b + 4, a + 4 * b + 10, ...
>
> And both your implementation and the wikipedia article generates
>
>     a, a + b, a + 2 * b, a + 3 * b + 1, a + 4 * b + 4, ...
>
> So the last term in the sum lags one step behind.
>
> The solution, I think, is as simple as swapping the assignments of index
> and inc in your implementation.
>
> I suppose the impact of this is nil, but I'm not an expert for judging
> that.
>
> Thanks,
>
> Juan Manuel Gimeno
>

Reply via email to