[issue13703] Hash collision security issue

Paul McMillan Thu, 05 Jan 2012 13:40:09 -0800

Paul McMillan <p...@mcmillan.ws> added the comment:

Marc-Andre: Victor already pasted the relevant part of my code:
http://bugs.python.org/issue13703#msg150568
The link to the fuller version, with revision history and a copy of the code 
before I modified it is here:
https://gist.github.com/0a91e52efa74f61858b5


>Why? The attack doesn't work with short strings? What do you call a "short 
>string"?

Well, the demonstrated collision is for 16 character ascii strings. Worst case 
UTF-8, we're looking at 3 manipulable bytes per character, but they may be 
harder to collide since some of those bytes are fixed.

> only be making it harder for script kiddies, but as soon as someone
> crypt-analysis the used hash algorithm, you're lost again.

Not true. What I propose is to make the amount of information necessary to 
analyze and generate collisions impractically large. My proposed hash function 
is certainly broken if you brute force the lookup table. There are undoubtedly 
other problems with it too. The point is that it's hard enough. We aren't going 
for perfect security - we're going for enough to make this attack impractical.

What are the downsides to counting collisions? For one thing, it's something 
that needs to be kept track of on a per-dict basis, and can't be cached the way 
the hash results are. How do you choose a good value for the limit? If you set 
it to something conservative, you still pay the collision price every time a 
dict  is created to discover that the keys collide. This means that it's 
possible to feed to bad data up to exactly the limit, and suddenly the python 
app is inexplicably slow. If you set the limit too aggressively, then sometimes 
valid data gets caught, and python randomly dies in hard to debug ways with an 
error the programmer has never seen in testing and cannot reproduce.

It adds a new way to kill most python applications, and so programs are going 
to have to be re-written to cope with it. It also introduces a new place to 
cause errors - if the WSGI server dies, it's hard for my application to catch 
that and recover gracefully.

>... not in Python itself, but if you consider all the types in Python
> extensions and classes implementing __hash__ in user code, the number
> of hash functions to fix quickly becomes unmanageable.

When we looked at the Django project, we wouldn't have anything to fix since 
ours end up relying on the python internal values eventually. I suspect a lot 
of other code is similar.

Mark said:
>What is the mechanism by which the attacker can determine the seeds?

The biggest information leak is probably the ordering in which dict entries are 
returned. This can be used to deduce the underlying hash values. This is much 
easier than trying to do it via timing.

> But that's not the issue we are supposed to be dealing with.
> A single (genuinely random) seed will deal with the attack described in 
> the talk and it is (almost) as fast as using 0 as a seed.

This is not true. A single random seed shifts the hash table, but does not 
actually prevent an attacker from generating collisions. Please see my other 
posts on the topic here and on the mailing list.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13703>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13703] Hash collision security issue

Reply via email to