Re: symmetric toeplitz hashing

Aisha Tammy Sat, 13 Jun 2020 14:30:06 -0700

On 6/13/20 9:19 AM, Theo Buehler wrote:
> On Sat, Jun 13, 2020 at 08:46:13AM -0400, David Higgs wrote:
>> On Fri, Jun 12, 2020 at 9:41 AM Theo Buehler <[email protected]> wrote:
>>
>>> I finally found the time to think about the mathematics of this some
>>> more and I'm now convinced that it's a sound construction. I hope that
>>> one or the other observation below will be useful for you.
>>>
>>> The hash as it is now can be proved to produce values in the full range
>>> of uint16_t, so that's good.
>>>
>>> As we discussed already, you can simplify the construction further.
>>>
>>
>> [snip]
>>
>>
>>> Next, I don't particularly like this magic STOEPLITZ_KEYSEED number, but
>>> I guess we can live with it.
>>>
>>> Another option would be to generate the key seed randomly. You will get
>>> a "good" hash that spreads out over all 16 bit values if and only if the
>>> random value has an odd number of binary digits.
>>>
>>> I haven't thought hard about this, but I don't immediately see a better
>>> way of generating such numbers than:
>>>
>>> int
>>> stoeplitz_good_seed(uint16_t seed)
>>> {
>>>         int             ones = 0;
>>>
>>>         while (seed > 0) {
>>>                 ones += seed % 2;
>>>                 seed /= 2;
>>>         }
>>>
>>>         return ones % 2;
>>> }
>>>
>>> uint16_t
>>> stoeplitz_seed_init(void)
>>> {
>>>         uint16_t        seed;
>>>
>>>         do {
>>>                 seed = arc4random() & UINT16_MAX;
>>>         } while (!stoeplitz_good_seed(seed));
>>>
>>>         return seed;
>>> }
>>>
>>> This will loop as long as it needs to get a good toeplitz key seed.
>>> Each time there is a 50% chance that it will find one, so this will
>>> need to loop n times with probability 1 / 2**n. This is basically the
>>> same situation as for arc4random_uniform() with an upper bound that is
>>> not a power of two.
>>>
>>
>> I neither read nor understand the math but assuming you've described it
>> correctly, you can do this more simply by realizing that a random bad seed
>> can be made into a good seed by toggling one (random) bit.
> 
> Yes. The thing is that you need to convince yourself that this is still
> uniformly distributed over the wanted numbers. But it's correct. In
> fact, it's enough to flip a fixed bit, so you can get away with one call
> to arc4random().


Its not immediately obvious because this is not always true.
Only true if your bad seed was at least surjective onto the desired codomain.

> 
>> You also
>> replace stoeplitz_good_seed() with __builtin_parity() and perhaps leverage
>> some intrinsics?
> 
> Others have pointed out off-list that one can use __builtin_popcount(),
> but __builtin_parity() is exactly what I want. Is it available on all
> architectures?
> 
>>
>> uint16_t
>> stoeplitz_seed_init(void) {
>>     uint16_t seed;
>>     seed = arc4random() & UINT16_MAX;
>>     if (!stoeplitz_good_seed(seed))
>>         seed ^= 1 << (arc4random() % 16);
>> }
>>
>> --david
>

Re: symmetric toeplitz hashing

Reply via email to