Dmitry Rubanovich added the comment:

Tim,

I am not testing randomly.  The scripts calculate secondary hash for **each** 
value in a ring to see how often this results in duplicate (or triplicate, 
etc.) values.  And they do it for each (mod 2**i) ring for i between 8 and 20.

Then (mostly as an afterthought) the scripts calculate how often each scheme 
results in a collision if more than 1 secondary hash index is calculated.  I 
used 3 secondary indexes as a demonstration of the "afterthought" part.

I do understand that the logic relies on being able to reach each value in 
0..-1+2**i in the worst case.  Isn't that though accomplished by my latest 
proposal? It was this:

"...unroll the 1st secondary hash key to use j = 2*j + P + 1.  So try to test 
for it before the loop.  But leave 5*j + P + 1 in the loop as is."

In the code that would mean changing of the loop in, for example, 
lookdict_index() from 

for (size_t perturb = hash;;) {
    perturb >>= PERTURB_SHIFT;
    i = mask & ((i << 2) + i + perturb + 1);
    ix = dk_get_index(k, i);
    if (ix == index) {
        return i;
    }
    if (ix == DKIX_EMPTY) {
        return DKIX_EMPTY;
    }
}

to this:

size_t perturb;
....

perturb = hash;
i = mask & ((i << 1) + perturb + 1); /* <---- key line */
ix = dk_get_index(k, i);
if (ix == index) {
    return i;
}
if (ix == DKIX_EMPTY) {
    return DKIX_EMPTY;
}

for (;;) {
    /* nothing changes in this loop */
    perturb >>= PERTURB_SHIFT;
    i = mask & ((i << 2) + i + perturb + 1);
    ix = dk_get_index(k, i);
    if (ix == index) {
        return i;
    }
    if (ix == DKIX_EMPTY) {
        return DKIX_EMPTY;
    }
}

And, of course, it would mean adding the same precheck in front of all loops 
which go through this secondary index calculation.

This prevents preventable collisions for the hashes, h, such that

h mod 2**k is equal to (h >> 5) mod 2**k,

where 2**k is the dict size.  This frequency of such occurrence for each dict 
size is what's printed by 

print_collision_counts(py3_6_1_lookdict_perturb) 

in either of the attached scripts.  Given that, for instance, it's 91 for 
dict's of size 256, it seems rather ripe for improvement.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30671>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to