Re: [Python-Dev] Counting collisions for the win

Frank Sievertsen Mon, 23 Jan 2012 00:55:03 -0800

Hello,

I'd still prefer to see a randomized hash()-function (at least for 3.3).


But to protect against the attacks it would be sufficient to use
randomization for collision resolution in dicts (and sets).

What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached.

The benefits:

* protection against the known attacks
* hash(X) stays stable and the same
* dict order is only changed when there are many collisions
* doctests will not break
* enhanced collision resolution
* RNG doesn't have to be initialized in smaller programs
* nearly no slowdown of most dicts
* second hash-function is only used for keys with higher collision-rate
* lower probability to leak secrets
* possibility to use different secrets for each dict

The drawback:

* need to add a second hash-function
* slower than using one hash-function only, when > 20 collisions
* need to add this to container-types? (if used for py3.3)
* need to expose this to the user? (if used for py3.3)
* works only for datatypes with this new function
* possible to implement without breaking ABI?

The following code is meant for explanation purpose only:

for(perturb = hash; ; perturb >>= 5) {
    i = (i << 2) + i + perturb + 1;

    if((collisions++) == 20) {
        // perturb is already zero after 13 rounds.
        // 20 collisions are rare.

        // you can add && (ma_mask > 256) to make 100% sure
        // that it's not used for smaller dicts.

        if(Py_TYPE(key)->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH) {
            // If type has a randomized hash, use this now for lookup
            i = perturb = PyObject_RandomizedHash(key));
        }
   .....


If I got this right we could add a new function "tp_randomized_hash"
to 3.3 release. But can we also add functions in older releases, without
breaking ABI?

If not, can we implement this somehow using a flag?

FOR OLDER RELEASE < 3.3:

Py_hash_t
PyObject_RandomizedHash(PyVarObject *o) {
    PyTypeObject *tp = Py_TYPE(v);
    if(! (tp->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH))
        return -1;

    global_flags_somewhere->USE_RANDOMIZED_HASH = 1;
    return (*tp->tp_hash)(v);
}

.... and in unicodeobject.c: (and wherever we need randomization)

static Py_hash_t
unicode_hash(PyUnicodeObject *self)
{
    Py_ssize_t len;
    Py_UNICODE *p;
    Py_hash_t x;
    Py_hash_t prefix=0;
    Py_hash_t suffix=0;

    if(global_flags_somewhere->USE_RANDOMIZED_HASH) {
        global_flags_somewhere->USE_RANDOMIZED_HASH = 0;

initialize_rng_if_not_already_done_and_return_seed(&prefix,&suffix);


    ..... (and don't cache in this case) .....


It's ugly, but if I understand this correctly, the GIL will
protect us against race-conditions, right?

Hello, internals experts: Would this work or is there a better way to do
this without breaking the ABI?

Frank
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Counting collisions for the win

Reply via email to