Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-18 Thread Victor Stinner
2012/1/18 Martin v. Löwis mar...@v.loewis.de:
 For 3.3 onwards, I'm skeptical whether all this configuration support is
 really necessary. I think a much smaller patch which leaves no choice
 would be more appropriate.

The configuration helps unit testing: see changes on Lib/test/*.py in
my last patch. I hesitate to say that the configuration is required
for tests. Anyway, users upgrading from Python 3.2 to 3.3 may need to
keep the same hash function and don't care of security (e.g. programs
running locally with trusted data).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-18 Thread Hrvoje Niksic

On 01/17/2012 09:29 PM, Martin v. Löwis wrote:

I(0) = H  MASK
PERTURB(0) = H
I(n+1) = (5*I(n) + 1 + PERTURB(n))  MASK
PERTURN(n+1) = PERTURB(n)  5

So if two objects O1 and O2 have the same hash value H, the sequence of
probed indices is the same for any MASK value. It will be a different
sequence, yes, but they will still collide on each and every slot.

This is the very nature of open addressing.


Open addressing can still deploy a collision resolution mechanism 
without this property. For example, double hashing uses a different hash 
function (applied to the key) to calculate PERTURB(0). To defeat it, the 
attacker would have to produce keys that hash the same using both hash 
functions.


Double hashing is not a good general solution for Python dicts because 
it complicates the interface of hash tables that support arbitrary keys. 
Still, it could be considered for dicts with known key types (built-ins 
could hardcode the alternative hash function) or for SafeDicts, if they 
are still considered.


Hrvoje
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread martin

It doesn't change anything, you will still get collisions.



That depends right? If the collision is because they all have the same
hash(), yes. It might be different if it is because the secondary hashing
(or whatever it's called :-) causes collisions.


But Python deals with the latter case just fine already. The open hashing
approach relies on the dict resizing enough to prevent collisions after
the dictionary has grown. Unless somebody can demonstrate a counter example,
I believe this discussion is a red herring.

Plus: if an attacker could craft keys that deliberately cause collisions
because of the dictionary size, they could likely also craft keys in the same
number that collide on actual hash values, bringing us back to the original
problem.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Stephen J. Turnbull
mar...@v.loewis.de writes:
   It doesn't change anything, you will still get collisions.
  
  
   That depends right? If the collision is because they all have the same
   hash(), yes. It might be different if it is because the secondary hashing
   (or whatever it's called :-) causes collisions.
  
  But Python deals with the latter case just fine already. The open hashing
  approach relies on the dict resizing enough to prevent collisions after
  the dictionary has grown. Unless somebody can demonstrate a counter example,
  I believe this discussion is a red herring.
 
  Plus: if an attacker could craft keys that deliberately cause collisions
  because of the dictionary size, they could likely also craft keys in the same
  number that collide on actual hash values, bringing us back to the original
  problem.

I thought that the original problem was that with N insertions in the
dictionary, by repeatedly inserting different keys generating the same
hash value an attacker could arrange that the cost of finding an open
slot is O(N), and thus the cost of N insertions is O(N^2).

If so, frequent resizing could make the attacker's problem much more
difficult, as the distribution of secondary probes should change with
each resize.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Victor Stinner
 I thought that the original problem was that with N insertions in the
 dictionary, by repeatedly inserting different keys generating the same
 hash value an attacker could arrange that the cost of finding an open
 slot is O(N), and thus the cost of N insertions is O(N^2).

 If so, frequent resizing could make the attacker's problem much more
 difficult, as the distribution of secondary probes should change with
 each resize.

The attack creates 60,000 strings (or more) with exactly the same hash
value. A dictionary uses hash(str)  DICT_MASK to compute the bucket
index, where DICT_HASH is the number of buckets minus one. If all
strings have the same hash value, we always start in the same bucket
and the key has to be compared to all previous strings to find the
next empty bucket. The attack works because a LOT of strings are
compared and comparing strings is slow.

If hash(str1)DICT_MASK == hash(str2)DICT_MASK but
hash(str1)!=hash(str2), strings are not compared (this is a common
optimization in Python), and the so the attack would not be successful
(it would be slow, but not as slow as comparing two strings).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Victor Stinner
I finished my patch transforming hash(str) to a randomized hash
function, see random-8.patch attached to the issue:
http://bugs.python.org/issue13703

The remaining question is which random number generator should be used
on Windows to initialize the hash secret (CryptoGen adds an overhead
of 10%, at least when the DLL is loaded dynamically), read the issue
for the details.

I plan to commit my fix to Python 3.3 if it is accepted. Then write a
simplified version to Python 3.2 and backport it to 3.1. Then backport
the simplified fix to 2.7, and finally to 2.6.

The vulnerability is public since one month, it is maybe time to fix
it before it is widely exploited.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Jeremy Sanders
Victor Stinner wrote:

 If hash(str1)DICT_MASK == hash(str2)DICT_MASK but
 hash(str1)!=hash(str2), strings are not compared (this is a common
 optimization in Python), and the so the attack would not be successful
 (it would be slow, but not as slow as comparing two strings).

It's a shame the hash function can't take a second salt parameter to include 
in the hash. Each dict could have its own salt, generated from a quick 
pseudo-random generator.

Jeremy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Jeremy Sanders
Jeremy Sanders wrote:

 Victor Stinner wrote:
 
 If hash(str1)DICT_MASK == hash(str2)DICT_MASK but
 hash(str1)!=hash(str2), strings are not compared (this is a common
 optimization in Python), and the so the attack would not be successful
 (it would be slow, but not as slow as comparing two strings).
 
 It's a shame the hash function can't take a second salt parameter to
 include in the hash. Each dict could have its own salt, generated from a
 quick pseudo-random generator.

Please ignore... forgot that the hashes are cached for strings!

Jeremy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Martin v. Löwis
 I thought that the original problem was that with N insertions in the
 dictionary, by repeatedly inserting different keys generating the same
 hash value an attacker could arrange that the cost of finding an open
 slot is O(N), and thus the cost of N insertions is O(N^2).
 
 If so, frequent resizing could make the attacker's problem much more
 difficult, as the distribution of secondary probes should change with
 each resize.

Not sure what you mean by distribution of secondary probes.

Let H be the initial hash value, and let MASK be the current size
of the dictionary. Then I(n), the sequence of dictionary indices
being probed, is computed as

   I(0) = H  MASK
   PERTURB(0) = H
   I(n+1) = (5*I(n) + 1 + PERTURB(n))  MASK
   PERTURN(n+1) = PERTURB(n)  5

So if two objects O1 and O2 have the same hash value H, the sequence of
probed indices is the same for any MASK value. It will be a different
sequence, yes, but they will still collide on each and every slot.

This is the very nature of open addressing. If it *wouldn't* try all
indices in the probe sequence, it may not be possible to perform
the lookup for a key correctly.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Martin v. Löwis
 I plan to commit my fix to Python 3.3 if it is accepted. Then write a
 simplified version to Python 3.2 and backport it to 3.1.

I'm opposed to any change to the hash values of strings in maintenance
releases, so I guess I'm opposed to your patch in principle.

See my next message for an alternative proposal.

 The vulnerability is public since one month, it is maybe time to fix
 it before it is widely exploited.

I don't think there is any urgency. The vulnerability has been known for
more than five years now. From creating a release to the point where
the change actually arrives at end users, many months will pass.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Victor Stinner
 I plan to commit my fix to Python 3.3 if it is accepted. Then write a
 simplified version to Python 3.2 and backport it to 3.1.

 I'm opposed to any change to the hash values of strings in maintenance
 releases, so I guess I'm opposed to your patch in principle.

If randomized hash cannot be turned on by default, an alternative is
to switch them off by default, and add an option (command line option,
environment variable, etc.) to enable it.

 The vulnerability is public since one month, it is maybe time to fix
 it before it is widely exploited.

 I don't think there is any urgency. The vulnerability has been known for
 more than five years now. From creating a release to the point where
 the change actually arrives at end users, many months will pass.

In 2003, Python was not seen as vulnerable. Maybe because the hash
function is different than Perl hash function, or because nobody tried
to generate collisions. Today it is clear that Python is vulnerable
(64 bits version is also affected), and it's really fast to generate
collisions using the right algorithm.

Why is it so long to fix the vulnerability in Python, whereas it was
fixed quickly in Ruby? (they chose to use a randomized hash)

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread martin

If randomized hash cannot be turned on by default, an alternative is
to switch them off by default, and add an option (command line option,
environment variable, etc.) to enable it.


That won't really fix the problem. If people install a new release because
it fixes a vulnerability, it better does so.


In 2003, Python was not seen as vulnerable. Maybe because the hash
function is different than Perl hash function, or because nobody tried
to generate collisions. Today it is clear that Python is vulnerable
(64 bits version is also affected), and it's really fast to generate
collisions using the right algorithm.


There is the common vulnerability to the threat of confusing threats
with vulnerabilities [1]. Python was vulnerable all the time, and nobody
claimed otherwise. It's just that nobody saw it as a threat. I still
don't see it as a practical threat, as there are many ways that people
use in practice to protect against this threat already. But I understand
that others feel threatened now.


Why is it so long to fix the vulnerability in Python, whereas it was
fixed quickly in Ruby? (they chose to use a randomized hash)


Because the risk of breakage for Python is much higher than it is for Ruby.

Regards,
Martin

[1] http://jps.anl.gov/Volume4_iss2/Paper3-RGJohnston.pdf

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Gregory P. Smith
On Tue, Jan 17, 2012 at 12:52 PM, Martin v. Löwis mar...@v.loewis.dewrote:

  I plan to commit my fix to Python 3.3 if it is accepted. Then write a
  simplified version to Python 3.2 and backport it to 3.1.

 I'm opposed to any change to the hash values of strings in maintenance
 releases, so I guess I'm opposed to your patch in principle.


Please at least consider his patch for 3.3 onwards then.  Changing the hash
seed per interpreter instance / process is the right thing to do going
forward.

What to do on maintenance releases is a separate discussion.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-17 Thread Martin v. Löwis
Am 18.01.2012 07:06, schrieb Gregory P. Smith:
 
 On Tue, Jan 17, 2012 at 12:52 PM, Martin v. Löwis mar...@v.loewis.de
 mailto:mar...@v.loewis.de wrote:
 
  I plan to commit my fix to Python 3.3 if it is accepted. Then write a
  simplified version to Python 3.2 and backport it to 3.1.
 
 I'm opposed to any change to the hash values of strings in maintenance
 releases, so I guess I'm opposed to your patch in principle.
 
 
 Please at least consider his patch for 3.3 onwards then.  Changing the
 hash seed per interpreter instance / process is the right thing to do
 going forward.

For 3.3 onwards, I'm skeptical whether all this configuration support is
really necessary. I think a much smaller patch which leaves no choice
would be more appropriate.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Gregory P. Smith
On Sun, Jan 15, 2012 at 9:44 AM, Guido van Rossum gu...@python.org wrote:

 On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel stefan...@behnel.dewrote:

 Guido van Rossum, 15.01.2012 17:10:
  On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
  Terry Reedy, 14.01.2012 06:43:
  On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
 
  It is perfectly okay to break existing users who had anything
 depending
  on ordering of internal hash tables. Their code was already broken.
 
  Given that the doc says Return the hash value of the object, I do
 not
  think we should be so hard-nosed. The above clearly implies that
 there is
  such a thing as *the* Python hash value for an object. And indeed,
 that
  has
  been true across many versions. If we had written Return a hash value
  for
  the object, which can vary from run to run, the case would be
 different.
 
  Just a side note, but I don't think hash() is the right place to
 document
  this.
 
  You mean we shouldn't document that the hash() of a string will vary per
  run?

 No, I mean that the hash() builtin function is not the right place to
 document the behaviour of a string hash. That should go into the string
 object documentation.

 Although, arguably, it may be worth mentioning in the docs of hash() that,
 in general, hash values of builtin types are bound to the lifetime of the
 interpreter instance (or entire runtime?) and may change after restarts. I
 think that's a reasonable restriction to document that prominently, even
 if
 it will only apply to str for the time being.


 Actually it will apply to a lot more than str, because the hash of
 (immutable) compound objects is often derived from the hash of the
 constituents, e.g. hash of a tuple.


  Hashing is a protocol in Python, just like indexing or iteration.
  Nothing keeps an object from changing its hash value due to
 modification,
 
  Eh? There's a huge body of cultural awareness that only immutable
 objects
  should define a hash, implying that the hash remains constant during the
  object's lifetime.
 
  and that would even be valid in the face of the usual dict lookup
  invariants if changes are only applied while the object is not
 referenced
  by any dict.
 
  And how would you know it isn't?

 Well, if it's an object with a mutable hash then it's up to the
 application
 defining that object to make sure it's used in a sensible way.
 Immutability
 just makes your life easier. I can imagine that an object gets removed
 from
 a dict (say, a cache), modified and then reinserted, and I think it's
 valid
 to allow the modification to have an impact on the hash in this case, in
 order to accommodate for any changes to equality comparisons due to the
 modification.


 That could be considered valid only in a very abstract, theoretical,
 non-constructive way, since there is no protocol to detect removal from a
 dict (and you cannot assume an object is used in only one dict at a time).


 That being said, it seems that the Python docs actually consider constant
 hashes a requirement rather than a virtue.

 http://docs.python.org/glossary.html#term-hashable

 
 An object is hashable if it has a hash value which never changes during
 its
 lifetime (it needs a __hash__() method), and can be compared to other
 objects (it needs an __eq__() or __cmp__() method). Hashable objects which
 compare equal must have the same hash value.
 

 It also seems to me that the wording has a hash value which never changes
 during its lifetime makes it pretty clear that the lifetime of the hash
 value is not guaranteed to supersede the lifetime of the object (although
 that's a rather muddy definition - memory lifetime? or pickle-unpickle as
 well?).


 Across pickle-unpickle it's not considered the same object. Pickling at
 best preserves values.


Updating the docs to explicitly clarify this sounds like a good idea.  How
does this wording to be added to the glossary.rst hashing section sound?

Hash values may not be stable across Python processes and must not be
used for storage or otherwise communicated outside of a single Python
session.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Paul McMillan
 As I understand it, the way the attack works is that a *single*
 malicious request from the attacker can DoS the server by eating CPU
 resources while evaluating a massive collision chain induced in a dict
 by attacker supplied data. Explicitly truncating the collision chain
 boots them out almost immediately (likely with a 500 response for an
 internal server error), so they no longer affect other events, threads
 and processes on the same machine.

This is only true in the specific attack presented at 28c3. If an
attacker can insert data without triggering the attack, it's possible
to produce (in the example of a web application) urls that (regardless
of the request) always produce pathological behavior. For example, a
collection of pathological usernames might make it impossible to list
users (and so choose which ones to delete) without resorting to
removing the problem data at an SQL level.

This is why the simply throw an error solution isn't a complete fix.
Making portions of an interface unusable for regular users is clearly
a bad thing, and is clearly applicable to other types of poisoned data
as well. We need to detect collisions and work around them
transparently.

 However, such an app would have been crippled by the original DoS
 anyway, since its performance would have been gutted - the collision
 chain limiting just means it will trigger exceptions for the cases
 that would been insanely slow.

We can do better than saying it would have been broken before, it's
broken differently now. The universal hash function idea has merit,
and for practical purposes hash randomization would fix this too
(since colliding data is only likely to collide within a single
process, persistent poisoning is far less feasible).

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Tim Delaney
On 17 January 2012 09:23, Paul McMillan p...@mcmillan.ws wrote:

 This is why the simply throw an error solution isn't a complete fix.
 Making portions of an interface unusable for regular users is clearly
 a bad thing, and is clearly applicable to other types of poisoned data
 as well. We need to detect collisions and work around them
 transparently.


What if in a pathological collision (e.g.  1000 collisions), we increased
the size of a dict by a small but random amount? Should be transparent,
have neglible speed penalty, maximal reuse of existing code, and should be
very difficult to attack since the dictionary would change size in a (near)
non-deterministic manner when being attacked (i.e. first attack causes
non-deterministic remap, next attack should fail).

It should also have near-zero effect on existing tests and frameworks since
we would only get the non-deterministic behaviour in pathological cases,
which we would presumably need new tests for.

Thoughts?

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Tim Delaney
On 17 January 2012 10:14, Tim Delaney timothy.c.dela...@gmail.com wrote:

 On 17 January 2012 09:23, Paul McMillan p...@mcmillan.ws wrote:

 This is why the simply throw an error solution isn't a complete fix.
 Making portions of an interface unusable for regular users is clearly
 a bad thing, and is clearly applicable to other types of poisoned data
 as well. We need to detect collisions and work around them
 transparently.


 What if in a pathological collision (e.g.  1000 collisions), we increased
 the size of a dict by a small but random amount? Should be transparent,
 have neglible speed penalty, maximal reuse of existing code, and should be
 very difficult to attack since the dictionary would change size in a (near)
 non-deterministic manner when being attacked (i.e. first attack causes
 non-deterministic remap, next attack should fail).

 It should also have near-zero effect on existing tests and frameworks
 since we would only get the non-deterministic behaviour in pathological
 cases, which we would presumably need new tests for.

 Thoughts?


And one thought I had immediately after hitting send is that there could be
an attack of the form build a huge dict, then hit it with something that
causes it to rehash due to 1000 collisions. But that's not really going
to be any worse than just building a huge dict and hitting a resize anyway.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Victor Stinner
2012/1/17 Tim Delaney timothy.c.dela...@gmail.com:
 What if in a pathological collision (e.g.  1000 collisions), we increased
 the size of a dict by a small but random amount?

It doesn't change anything, you will still get collisions.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-16 Thread Guido van Rossum
On Mon, Jan 16, 2012 at 4:16 PM, Victor Stinner 
victor.stin...@haypocalc.com wrote:

 2012/1/17 Tim Delaney timothy.c.dela...@gmail.com:
  What if in a pathological collision (e.g.  1000 collisions), we
 increased
  the size of a dict by a small but random amount?

 It doesn't change anything, you will still get collisions.


That depends right? If the collision is because they all have the same
hash(), yes. It might be different if it is because the secondary hashing
(or whatever it's called :-) causes collisions.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-15 Thread Stefan Behnel
Terry Reedy, 14.01.2012 06:43:
 On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
 
 It is perfectly okay to break existing users who had anything depending
 on ordering of internal hash tables. Their code was already broken.
 
 Given that the doc says Return the hash value of the object, I do not
 think we should be so hard-nosed. The above clearly implies that there is
 such a thing as *the* Python hash value for an object. And indeed, that has
 been true across many versions. If we had written Return a hash value for
 the object, which can vary from run to run, the case would be different.

Just a side note, but I don't think hash() is the right place to document
this. Hashing is a protocol in Python, just like indexing or iteration.
Nothing keeps an object from changing its hash value due to modification,
and that would even be valid in the face of the usual dict lookup
invariants if changes are only applied while the object is not referenced
by any dict. So the guarantees do not depend on the function hash() and may
be even weaker than your above statement.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-15 Thread Guido van Rossum
On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel stefan...@behnel.de wrote:

 Terry Reedy, 14.01.2012 06:43:
  On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
 
  It is perfectly okay to break existing users who had anything depending
  on ordering of internal hash tables. Their code was already broken.
 
  Given that the doc says Return the hash value of the object, I do not
  think we should be so hard-nosed. The above clearly implies that there is
  such a thing as *the* Python hash value for an object. And indeed, that
 has
  been true across many versions. If we had written Return a hash value
 for
  the object, which can vary from run to run, the case would be different.

 Just a side note, but I don't think hash() is the right place to document
 this.


You mean we shouldn't document that the hash() of a string will vary per
run?


 Hashing is a protocol in Python, just like indexing or iteration.
 Nothing keeps an object from changing its hash value due to modification,


Eh? There's a huge body of cultural awareness that only immutable objects
should define a hash, implying that the hash remains constant during the
object's lifetime.


 and that would even be valid in the face of the usual dict lookup
 invariants if changes are only applied while the object is not referenced
 by any dict.


And how would you know it isn't?


 So the guarantees do not depend on the function hash() and may
 be even weaker than your above statement.


There are no actual guarantees for hash(), but lots of rules for
well-behaved hashes.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-15 Thread Stefan Behnel
Guido van Rossum, 15.01.2012 17:10:
 On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
 Terry Reedy, 14.01.2012 06:43:
 On 1/13/2012 8:58 PM, Gregory P. Smith wrote:

 It is perfectly okay to break existing users who had anything depending
 on ordering of internal hash tables. Their code was already broken.

 Given that the doc says Return the hash value of the object, I do not
 think we should be so hard-nosed. The above clearly implies that there is
 such a thing as *the* Python hash value for an object. And indeed, that
 has
 been true across many versions. If we had written Return a hash value
 for
 the object, which can vary from run to run, the case would be different.

 Just a side note, but I don't think hash() is the right place to document
 this.
 
 You mean we shouldn't document that the hash() of a string will vary per
 run?

No, I mean that the hash() builtin function is not the right place to
document the behaviour of a string hash. That should go into the string
object documentation.

Although, arguably, it may be worth mentioning in the docs of hash() that,
in general, hash values of builtin types are bound to the lifetime of the
interpreter instance (or entire runtime?) and may change after restarts. I
think that's a reasonable restriction to document that prominently, even if
it will only apply to str for the time being.


 Hashing is a protocol in Python, just like indexing or iteration.
 Nothing keeps an object from changing its hash value due to modification,
 
 Eh? There's a huge body of cultural awareness that only immutable objects
 should define a hash, implying that the hash remains constant during the
 object's lifetime.
 
 and that would even be valid in the face of the usual dict lookup
 invariants if changes are only applied while the object is not referenced
 by any dict.
 
 And how would you know it isn't?

Well, if it's an object with a mutable hash then it's up to the application
defining that object to make sure it's used in a sensible way. Immutability
just makes your life easier. I can imagine that an object gets removed from
a dict (say, a cache), modified and then reinserted, and I think it's valid
to allow the modification to have an impact on the hash in this case, in
order to accommodate for any changes to equality comparisons due to the
modification.

That being said, it seems that the Python docs actually consider constant
hashes a requirement rather than a virtue.

http://docs.python.org/glossary.html#term-hashable


An object is hashable if it has a hash value which never changes during its
lifetime (it needs a __hash__() method), and can be compared to other
objects (it needs an __eq__() or __cmp__() method). Hashable objects which
compare equal must have the same hash value.


It also seems to me that the wording has a hash value which never changes
during its lifetime makes it pretty clear that the lifetime of the hash
value is not guaranteed to supersede the lifetime of the object (although
that's a rather muddy definition - memory lifetime? or pickle-unpickle as
well?).

However, this entry in the glossary only seems to have appeared with Py2.6,
likely as a result of the abc changes. So it won't help in defending a
change to the hash function.


 So the guarantees do not depend on the function hash() and may
 be even weaker than your above statement.
 
 There are no actual guarantees for hash(), but lots of rules for
 well-behaved hashes.

Absolutely.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-15 Thread Gregory P. Smith
On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel stefan...@behnel.de wrote:

 It also seems to me that the wording has a hash value which never changes
 during its lifetime makes it pretty clear that the lifetime of the hash
 value is not guaranteed to supersede the lifetime of the object (although
 that's a rather muddy definition - memory lifetime? or pickle-unpickle as
 well?).


Lifetime to me means of that specific instance of the object. I would not
expect that to survive pickle-unpickle.


 However, this entry in the glossary only seems to have appeared with Py2.6,
 likely as a result of the abc changes. So it won't help in defending a
 change to the hash function.


Ugh, I really hope there is no code out there depending on the hash
function being the same across a pickle and unpickle boundary.
 Unfortunately the hash function was last changed in 1996 in
http://hg.python.org/cpython/rev/839f72610ae1 so it is possible someone
somewhere has written code blindly assuming that non-guarantee is true.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-15 Thread Antoine Pitrou
On Sun, 15 Jan 2012 17:46:36 +0100
Stefan Behnel stefan...@behnel.de wrote:
 Guido van Rossum, 15.01.2012 17:10:
  On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
  Terry Reedy, 14.01.2012 06:43:
  On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
 
  It is perfectly okay to break existing users who had anything depending
  on ordering of internal hash tables. Their code was already broken.
 
  Given that the doc says Return the hash value of the object, I do not
  think we should be so hard-nosed. The above clearly implies that there is
  such a thing as *the* Python hash value for an object. And indeed, that
  has
  been true across many versions. If we had written Return a hash value
  for
  the object, which can vary from run to run, the case would be different.
 
  Just a side note, but I don't think hash() is the right place to document
  this.
  
  You mean we shouldn't document that the hash() of a string will vary per
  run?
 
 No, I mean that the hash() builtin function is not the right place to
 document the behaviour of a string hash. That should go into the string
 object documentation.

No, but we can document that *any* hash() value can vary between runs
without being specific about which builtin types randomize their
hashes right now.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-15 Thread Guido van Rossum
On Sun, Jan 15, 2012 at 8:46 AM, Stefan Behnel stefan...@behnel.de wrote:

 Guido van Rossum, 15.01.2012 17:10:
  On Sun, Jan 15, 2012 at 6:30 AM, Stefan Behnel wrote:
  Terry Reedy, 14.01.2012 06:43:
  On 1/13/2012 8:58 PM, Gregory P. Smith wrote:
 
  It is perfectly okay to break existing users who had anything
 depending
  on ordering of internal hash tables. Their code was already broken.
 
  Given that the doc says Return the hash value of the object, I do not
  think we should be so hard-nosed. The above clearly implies that there
 is
  such a thing as *the* Python hash value for an object. And indeed, that
  has
  been true across many versions. If we had written Return a hash value
  for
  the object, which can vary from run to run, the case would be
 different.
 
  Just a side note, but I don't think hash() is the right place to
 document
  this.
 
  You mean we shouldn't document that the hash() of a string will vary per
  run?

 No, I mean that the hash() builtin function is not the right place to
 document the behaviour of a string hash. That should go into the string
 object documentation.

 Although, arguably, it may be worth mentioning in the docs of hash() that,
 in general, hash values of builtin types are bound to the lifetime of the
 interpreter instance (or entire runtime?) and may change after restarts. I
 think that's a reasonable restriction to document that prominently, even if
 it will only apply to str for the time being.


Actually it will apply to a lot more than str, because the hash of
(immutable) compound objects is often derived from the hash of the
constituents, e.g. hash of a tuple.


  Hashing is a protocol in Python, just like indexing or iteration.
  Nothing keeps an object from changing its hash value due to
 modification,
 
  Eh? There's a huge body of cultural awareness that only immutable objects
  should define a hash, implying that the hash remains constant during the
  object's lifetime.
 
  and that would even be valid in the face of the usual dict lookup
  invariants if changes are only applied while the object is not
 referenced
  by any dict.
 
  And how would you know it isn't?

 Well, if it's an object with a mutable hash then it's up to the application
 defining that object to make sure it's used in a sensible way. Immutability
 just makes your life easier. I can imagine that an object gets removed from
 a dict (say, a cache), modified and then reinserted, and I think it's valid
 to allow the modification to have an impact on the hash in this case, in
 order to accommodate for any changes to equality comparisons due to the
 modification.


That could be considered valid only in a very abstract, theoretical,
non-constructive way, since there is no protocol to detect removal from a
dict (and you cannot assume an object is used in only one dict at a time).


 That being said, it seems that the Python docs actually consider constant
 hashes a requirement rather than a virtue.

 http://docs.python.org/glossary.html#term-hashable

 
 An object is hashable if it has a hash value which never changes during its
 lifetime (it needs a __hash__() method), and can be compared to other
 objects (it needs an __eq__() or __cmp__() method). Hashable objects which
 compare equal must have the same hash value.
 

 It also seems to me that the wording has a hash value which never changes
 during its lifetime makes it pretty clear that the lifetime of the hash
 value is not guaranteed to supersede the lifetime of the object (although
 that's a rather muddy definition - memory lifetime? or pickle-unpickle as
 well?).


Across pickle-unpickle it's not considered the same object. Pickling at
best preserves values.

However, this entry in the glossary only seems to have appeared with Py2.6,
 likely as a result of the abc changes. So it won't help in defending a
 change to the hash function.


  So the guarantees do not depend on the function hash() and may
  be even weaker than your above statement.
 
  There are no actual guarantees for hash(), but lots of rules for
  well-behaved hashes.

 Absolutely.


-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Stephen J. Turnbull
Jack Diederich writes:
  On Thu, Jan 12, 2012 at 9:57 PM, Guido van Rossum gu...@python.org wrote:
   Hm... I started out as a big fan of the randomized hash, but thinking more
   about it, I actually believe that the chances of some legitimate app having
  1000 collisions are way smaller than the chances that somebody's code will
   break due to the variable hashing.
  
  Python's dicts are designed to avoid hash conflicts by resizing and
  keeping the available slots bountiful.  1000 conflicts sounds like a
  number that couldn't be hit accidentally

I may be missing something, but AIUI, with the resize, the search for
an unused slot after collision will be looking in a different series
of slots, so the N counter for the N^2 behavior resets on resize.  If
not, you can delete this message now.

If so, since (a) in the error-on-many-collisions approach we're adding
a test here for collision count anyway and (b) we think this is almost
never gonna happen, can't we defuse the exploit by just resizing the
dict after 1000 collisions, with strictly better performance than the
error approach, and almost current performance for normal input?

In order to prevent attackers from exploiting every 1000th collision
to force out-of-memory, the expansion factor for collision-induced
resizing could be very small.  (I don't know if that's possible in
the Python dict implementation, if the algorithm requires something
like doubling the dict size on every resize this is right out, of
course.)

Or, since this is an error/rare path anyway, offer the user a choice
of an error or a resize on hitting 1000 collisions?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Antoine Pitrou
On Sat, 14 Jan 2012 04:45:57 +0100
mar...@v.loewis.de wrote:
  What an implementation looks like:
 
   http://pastebin.com/9ydETTag
 
  some stuff to be filled in, but this is all that is really required.
 
 I think this statement (and the patch) is wrong. You also need to change
 the byte string hashing, at least for 2.x. This I consider the biggest
 flaw in that approach - other people may have written string-like objects
 which continue to compare equal to a string but now hash different.

They're unlikely to have rewritten the hash algorithm by hand -
especially given the caveats wrt. differences between Python integers
and C integers.
Rather, they would have returned the hash() of the equivalent str or
unicode object.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Antoine Pitrou
On Sat, 14 Jan 2012 13:55:22 +1100
Steven D'Aprano st...@pearwood.info wrote:
 On 14/01/12 12:58, Gregory P. Smith wrote:
 
  I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be
  back ported to any Python version.
 
  It is perfectly okay to break existing users who had anything depending on
  ordering of internal hash tables. Their code was already broken.
 
 For the record:
 
 steve@runes:~$ python -c print(hash('spam ham'))
 -376510515
 steve@runes:~$ jython -c print(hash('spam ham'))
 2054637885

Not to mention:

$ ./python -c print(hash('spam ham'))
-6071355389066156083

(64-bit CPython)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread martin

I think this statement (and the patch) is wrong. You also need to change
the byte string hashing, at least for 2.x. This I consider the biggest
flaw in that approach - other people may have written string-like objects
which continue to compare equal to a string but now hash different.


They're unlikely to have rewritten the hash algorithm by hand -
especially given the caveats wrt. differences between Python integers
and C integers.


See the CHAR_HASH macro in
http://hg.python.org/cpython/file/e78f00dbd7ae/Modules/expat/xmlparse.c

It's not *that* unlikely that more copies of that algorithm exist.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Martin v. Löwis
Am 13.01.2012 18:08, schrieb Mark Dickinson:
 On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum gu...@python.org wrote:
 How
 pathological the data needs to be before the collision counter triggers? I'd
 expect *very* pathological.
 
 How pathological do you consider the set
 
{1  n for n in range(2000)}
 
 to be?  

I think this is not a counter-example for the proposed algorithm (at
least not in the way I think it should be implemented).

Those values may collide on the slot in the set, but they don't collide
on the actual hash value.

So in order to determine whether the collision limit is exceeded, we
shouldn't count colliding slots, but colliding hash values (which we
will all encounter during an insert).

 though admittedly only around 30 collisions per hash value.

I do consider the case of hashing integers with only one bit set
pathological. However, this can be overcome by factoring the magnitude
of the number into the hash as well.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Martin v. Löwis
Am 14.01.2012 01:37, schrieb Benjamin Peterson:
 2012/1/13 Guido van Rossum gu...@python.org:
 Really? Even though you came up with specifically to prove me wrong?
 
 Coming up with a counterexample now invalidates it?

There are two concerns here:
- is it possible to come up with an example of constructed values that
  show many collisions in a way that poses a threat? To this, the answer
  is apparently yes, and the proposed reaction is to hard-limit the
  number of collisions accepted by the implementation.
- then, *assuming* such a limitation is in place: is it possible to come
  up with a realistic application that would break under this
  limitation. Mark's example is no such realistic application, instead,
  it is yet another example demonstrating collisions using constructed
  values (although the specific example would continue to work fine
  even under the limitation).

A valid counterexample would have to come from a real application, or
at least from a scenario that is plausible for a real application.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Gregory P. Smith
My patch example does change the bytes object hash as well as Unicode.
On Jan 13, 2012 7:46 PM, mar...@v.loewis.de wrote:

 What an implementation looks like:

  http://pastebin.com/9ydETTag

 some stuff to be filled in, but this is all that is really required.


 I think this statement (and the patch) is wrong. You also need to change
 the byte string hashing, at least for 2.x. This I consider the biggest
 flaw in that approach - other people may have written string-like objects
 which continue to compare equal to a string but now hash different.

 Regards,
 Martin


 __**_
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/**mailman/listinfo/python-devhttp://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
 greg%40krypto.orghttp://mail.python.org/mailman/options/python-dev/greg%40krypto.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Gregory P. Smith
FWIW the quick change i pastebin'ed is basically covered by the change
already under review in http://bugs.python.org/review/13704/show.  I've
made my comments and suggestions there.

I looked into Modules/expat/xmlparse.c and it has an odd copy of the old
string hash algorithm entirely for its own internal use and its own
internal hash table implementations.  That module is likely vulnerable to
creatively crafted documents for the same reason.  With 13704 and the
public API it provides to get the random hash seed, that module could
simply be updated to use that in its own hash implementation.

As for when to enable it or not, I unfortunately have to agree, despite my
wild desires we can't turn on the hash randomization change by default in
anything prior to 3.3.

-gps

On Sat, Jan 14, 2012 at 11:17 AM, Gregory P. Smith g...@krypto.org wrote:

 My patch example does change the bytes object hash as well as Unicode.
 On Jan 13, 2012 7:46 PM, mar...@v.loewis.de wrote:

  What an implementation looks like:

  http://pastebin.com/9ydETTag

 some stuff to be filled in, but this is all that is really required.


 I think this statement (and the patch) is wrong. You also need to change
 the byte string hashing, at least for 2.x. This I consider the biggest
 flaw in that approach - other people may have written string-like objects
 which continue to compare equal to a string but now hash different.

 Regards,
 Martin


 __**_
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/**mailman/listinfo/python-devhttp://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
 greg%40krypto.orghttp://mail.python.org/mailman/options/python-dev/greg%40krypto.org


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Steven D'Aprano

Victor Stinner wrote:


- Marc Andre Lemburg proposes to fix the vulnerability directly in
dict (for any key type). The patch raises an exception if a lookup
causes more than 1000 collisions.



Am I missing something? How does this fix the vulnerability? It seems to me 
that the only thing this does is turn one sort of DOS attack into another sort 
of DOS attack: hostile users will just cause hash collisions until an 
exception is raised and the application falls over.


Catching these exceptions, and recovering from them (how?), would be the 
responsibility of the application author. Given that developers are unlikely 
to ever see 1000 collisions by accident, or even realise that it could happen, 
I don't expect that many people will do that -- until they personally get bitten.




--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Steven D'Aprano

Guido van Rossum wrote:

On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith g...@krypto.org wrote:


It is perfectly okay to break existing users who had anything depending on
ordering of internal hash tables. Their code was already broken. We 
*will*provide a flag and/or environment variable that can be set to turn the
feature off at their own peril which they can use in their test harnesses
that are stupid enough to use doctests with order dependencies.



No, that is not how we usually take compatibility between bugfix releases.
Your code is already broken is not an argument to break forcefully what
worked (even if by happenstance) before. The difference between CPython and
Jython (or between different CPython feature releases) also isn't relevant
-- historically we have often bent over backwards to avoid changing
behavior that was technically undefined, if we believed it would affect a
significant fraction of users.

I don't think anyone doubts that this will break lots of code (at least,
the arguments I've heard have been their code is broken, not nobody does
that).


I don't know about lots of code, but it will break at least one library (or 
so I'm told):


http://mail.python.org/pipermail/python-list/2012-January/1286535.html




--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-14 Thread Nick Coghlan
On Sun, Jan 15, 2012 at 2:42 PM, Steven D'Aprano st...@pearwood.info wrote:
 Victor Stinner wrote:

 - Marc Andre Lemburg proposes to fix the vulnerability directly in
 dict (for any key type). The patch raises an exception if a lookup
 causes more than 1000 collisions.



 Am I missing something? How does this fix the vulnerability? It seems to me
 that the only thing this does is turn one sort of DOS attack into another
 sort of DOS attack: hostile users will just cause hash collisions until an
 exception is raised and the application falls over.

 Catching these exceptions, and recovering from them (how?), would be the
 responsibility of the application author. Given that developers are unlikely
 to ever see 1000 collisions by accident, or even realise that it could
 happen, I don't expect that many people will do that -- until they
 personally get bitten.

As I understand it, the way the attack works is that a *single*
malicious request from the attacker can DoS the server by eating CPU
resources while evaluating a massive collision chain induced in a dict
by attacker supplied data. Explicitly truncating the collision chain
boots them out almost immediately (likely with a 500 response for an
internal server error), so they no longer affect other events, threads
and processes on the same machine.

In some ways, the idea is analogous to the way we implement explicit
recursion limiting in an attempt to avoid actually blowing the C stack
- we take a hard-to-detect-and-hard-to-handle situation (i.e. blowing
the C stack or malicious generation of long collision chains in a
dict) and replace it with something that is easy to detect and can be
handled by normal exception processing (i.e. a recursion depth
exception or one reporting an excessive number of slot collisions in a
dict lookup).

That then makes the default dict implementation safe from this kind of
attack by default, and use cases that are getting that many collisions
legitimately can be handled in one of two ways:
- switch to a more appropriate data type (if you're getting that many
collisions with benign data, a dict is probably the wrong container to
be using)
- offer a mechanism (command line switch or environment variable) to
turn the collision limiting off

Now, where you can still potentially run into problems is if a single
shared dict is used to store both benign and malicious data - if the
malicious data makes it into the destination dict before the exception
finally gets triggered, and then benign data also happens to trigger
the same collision chain, then yes, the entire app may fall over.
However, such an app would have been crippled by the original DoS
anyway, since its performance would have been gutted - the collision
chain limiting just means it will trigger exceptions for the cases
that would been insanely slow.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Frank Sievertsen

Am 13.01.2012 02:24, schrieb Victor Stinner:

My patch doesn't fix the DoS, it just make the attack more complex.
The attacker cannot pregenerate data for an attack: (s)he has first to
compute the hash secret, and then compute hash collisions using the
secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
system). So I hope that computing collisions requires a lot of CPU
time (is slow) to make the attack ineffective with today computers.
Unfortunately it requires only a few seconds to compute enough 32bit 
collisions on one core with no precomputed data.  I'm sure it's possible 
to make this less than a second.


In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) 
^ suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) 
is possible.


So the question is: How difficult is it to guess the seed?

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Victor Stinner
 Unfortunately it requires only a few seconds to compute enough 32bit
 collisions on one core with no precomputed data.

Are you running the hash function backward to generate strings with
the same value, or you are more trying something like brute forcing?

And how do you get the hash secret? You need it to run an attack.

 In fact, since hash(X) == hash(Y) is independent of the suffix [ hash(X) ^
 suffix == hash(Y) ^ suffix ], a lot of precomputation (from the tail) is
 possible.

My change adds also a prefix (a prefix and a suffix). I don't know if
it changes anything for generating collisions.

 So the question is: How difficult is it to guess the seed?

I wrote some remarks about that in the issue. For example:

(hash(\0)^1) ^ (hash(\0\0)^2) gives ((prefix * 103) 
HASH_MASK) ^ ((prefix * 103**2)   HASH_MASK)

I suppose that you don't have directly the full output of hash(str) in
practical, but hash(str)  DICT_MASK where DICT_MASK depends is the
size of the internal dict array minus 1. For example, for a dictionary
of 65,536 items, the mask is 0x1 and so cannot gives you more than
17 bits of hash(str) output. I still don't know how difficult it is to
retreive hash(str) bits from repr(dict).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Lennart Regebro
On Fri, Jan 13, 2012 at 02:24, Victor Stinner
victor.stin...@haypocalc.com wrote:
 - Glenn Linderman proposes to fix the vulnerability by adding a new
 safe dict type (only accepting string keys). His proof-of-concept
 (SafeDict.py) uses a secret of 64 random bits and uses it to compute
 the hash of a key.

This is my preferred solution. The vulnerability is basically only in
the dictionary you keep the form data you get from a request. This
solves it easily and nicely. It can also be a separate module
installable for Python 2, which many web frameworks still use, so it
can be practical implementable now, and not in a couple of years.

Then again, nothing prevents us from having both this, *and* one of
the other solutions.  :-)

//Lennart
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Frank Sievertsen



Unfortunately it requires only a few seconds to compute enough 32bit
collisions on one core with no precomputed data.

Are you running the hash function backward to generate strings with
the same value, or you are more trying something like brute forcing?


If you try it brute force to hit a specific target, you'll only find 
only one good string every 4 billion tries. That's why you first blow up 
your target:


You start backward from an arbitrary target-value. You brute force for 3 
characters, for example, this will give you 16 million intermediate 
values from which you know that they'll end up in your target-value.


Those 16 million values are a huge target for now brute-forcing forward: 
Every 256 tries you'll hit one of these values.



And how do you get the hash secret? You need it to run an attack.


I don't know. This was meant as an answer to the quoted text So I hope 
that computing collisions requires a lot of CPU time (is slow) to make 
the attack ineffective with today computers..


What I wanted to say is: The security relies on the fact that the 
attacker can't guess the prefix, not that he can't precompute the values 
and it takes hours or days to compute the collisions. If the prefix 
leaks out of the application, then the rest is trivial and done in a few 
seconds. The suffix is not important for the collision-prevention, but 
it will probably make it much harder to guess the prefix.


I don't know an effective way to get the prefix either, (if the 
application doesn't leak full hash(X) values).


Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread And Clover

On 2012-01-13 11:20, Lennart Regebro wrote:

The vulnerability is basically only in the dictionary you keep the
form data you get from a request.


I'd have to disagree with this statement. The vulnerability is anywhere 
that creates a dictionary (or set) from attacker-provided keys. That 
would include HTTP headers, RFC822-family subheaders and parameters, the 
environ, input taken from JSON or XML, and so on - and indeed hash 
collision attacks are not at all web-specific.


The problem with having two dict implementations is that a caller would 
have to tell libraries that use dictionaries which implementation to 
use. So for example an argument would have to be passed to json.load[s] 
to specify whether the input was known-sane or potentially hostile.


Any library could ever use dictionaries to process untrusted input *or 
any library that used another library that did* would have to pass such 
a flag through, which would quickly get very unwieldy indeed... or else 
they'd have to just always use safedict, in which case we're in pretty 
much the same position as we are with changing dict anyway.


--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bobi...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Mark Dickinson
On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum gu...@python.org wrote:
 How
 pathological the data needs to be before the collision counter triggers? I'd
 expect *very* pathological.

How pathological do you consider the set

   {1  n for n in range(2000)}

to be?  What about the set:

   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

?  The  2000 elements of the latter set have only 61 distinct hash
values on 64-bit machine, so there will be over 2000 total collisions
involved in creating this set (though admittedly only around 30
collisions per hash value).

-- 
Mark
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Guido van Rossum
On Fri, Jan 13, 2012 at 9:08 AM, Mark Dickinson dicki...@gmail.com wrote:

 On Fri, Jan 13, 2012 at 2:57 AM, Guido van Rossum gu...@python.org
 wrote:
  How
  pathological the data needs to be before the collision counter triggers?
 I'd
  expect *very* pathological.

 How pathological do you consider the set

   {1  n for n in range(2000)}

 to be?  What about the set:

   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

 ?  The  2000 elements of the latter set have only 61 distinct hash
 values on 64-bit machine, so there will be over 2000 total collisions
 involved in creating this set (though admittedly only around 30
 collisions per hash value).


Hm... So how does the collision counting work for this case?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Mark Dickinson
On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum gu...@python.org wrote:
 How pathological do you consider the set

   {1  n for n in range(2000)}

 to be?  What about the set:

   ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}

 ?  The  2000 elements of the latter set have only 61 distinct hash
 values on 64-bit machine, so there will be over 2000 total collisions
 involved in creating this set (though admittedly only around 30
 collisions per hash value).

 Hm... So how does the collision counting work for this case?

Ah, my bad.  It looks like the ieee754_powers_of_two is safe---IIUC,
it's the number of collisions involved in a single key-set operation
that's limited.  So a dictionary with keys {1n for n in range(2000)}
is fine, but a dictionary with keys  {1(61*n) for n in range(2000)}
is not:

 {1(n*61):True for n in range(2000)}
Traceback (most recent call last):
  File stdin, line 1, in module
  File stdin, line 1, in dictcomp
KeyError: 'too many hash collisions'
[67961 refs]

I'd still not consider this particularly pathological, though.

-- 
Mark
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Guido van Rossum
On Fri, Jan 13, 2012 at 10:13 AM, Mark Dickinson dicki...@gmail.com wrote:

 On Fri, Jan 13, 2012 at 5:43 PM, Guido van Rossum gu...@python.org
 wrote:
  How pathological do you consider the set
 
{1  n for n in range(2000)}
 
  to be?  What about the set:
 
ieee754_powers_of_two = {2.0**n for n in range(-1074, 1024)}
 
  ?  The  2000 elements of the latter set have only 61 distinct hash
  values on 64-bit machine, so there will be over 2000 total collisions
  involved in creating this set (though admittedly only around 30
  collisions per hash value).
 
  Hm... So how does the collision counting work for this case?

 Ah, my bad.  It looks like the ieee754_powers_of_two is safe---IIUC,
 it's the number of collisions involved in a single key-set operation
 that's limited.  So a dictionary with keys {1n for n in range(2000)}
 is fine, but a dictionary with keys  {1(61*n) for n in range(2000)}
 is not:

  {1(n*61):True for n in range(2000)}
 Traceback (most recent call last):
  File stdin, line 1, in module
  File stdin, line 1, in dictcomp
 KeyError: 'too many hash collisions'
 [67961 refs]

 I'd still not consider this particularly pathological, though.


Really? Even though you came up with specifically to prove me wrong?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Benjamin Peterson
2012/1/13 Guido van Rossum gu...@python.org:
 Really? Even though you came up with specifically to prove me wrong?

Coming up with a counterexample now invalidates it?



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Antoine Pitrou
On Thu, 12 Jan 2012 18:57:42 -0800
Guido van Rossum gu...@python.org wrote:
 Hm... I started out as a big fan of the randomized hash, but thinking more
 about it, I actually believe that the chances of some legitimate app having
 1000 collisions are way smaller than the chances that somebody's code will
 break due to the variable hashing.

Breaking due to variable hashing is deterministic: you notice it as
soon as you upgrade (and then you use PYTHONHASHSEED to disable
variable hashing). That seems better than unpredictable breaking when
some legitimate collision chain happens.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Victor Stinner
 - Glenn Linderman proposes to fix the vulnerability by adding a new
 safe dict type (only accepting string keys). His proof-of-concept
 (SafeDict.py) uses a secret of 64 random bits and uses it to compute
 the hash of a key.

We could mix Marc's collision counter with SafeDict idea (being able
to use a different secret for each dict): use hash(key, secret)
(simple example: hash(secret+key)) instead of hash(key) in dict (and
set), and change the secret if we have more than N collisions. But it
would slow down all dict lookup (dict creation, get, set, del, ...).
And getting new random data can also be slow.

SafeDict and hash(secret+key) lose the benefit of the cached hash
result. Because the hash result depends on a argument, we cannot cache
the result anymore, and we have to recompute the hash for each lookup
(even if you lookup the same key twice ore more).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Guido van Rossum
On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou solip...@pitrou.net wrote:

 On Thu, 12 Jan 2012 18:57:42 -0800
 Guido van Rossum gu...@python.org wrote:
  Hm... I started out as a big fan of the randomized hash, but thinking
 more
  about it, I actually believe that the chances of some legitimate app
 having
  1000 collisions are way smaller than the chances that somebody's code
 will
  break due to the variable hashing.

 Breaking due to variable hashing is deterministic: you notice it as
 soon as you upgrade (and then you use PYTHONHASHSEED to disable
 variable hashing). That seems better than unpredictable breaking when
 some legitimate collision chain happens.


Fair enough. But I'm now uncomfortable with turning this on for bugfix
releases. I'm fine with making this the default in 3.3, just not in 3.2,
3.1 or 2.x -- it will break too much code and organizations will have to
roll back the release or do extensive testing before installing a bugfix
release -- exactly what we *don't* want for those.

FWIW, I don't believe in the SafeDict solution -- you never know which
dicts you have to change.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Gregory P. Smith
On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum gu...@python.org wrote:

 On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou solip...@pitrou.netwrote:

 On Thu, 12 Jan 2012 18:57:42 -0800
 Guido van Rossum gu...@python.org wrote:
  Hm... I started out as a big fan of the randomized hash, but thinking
 more
  about it, I actually believe that the chances of some legitimate app
 having
  1000 collisions are way smaller than the chances that somebody's code
 will
  break due to the variable hashing.

 Breaking due to variable hashing is deterministic: you notice it as
 soon as you upgrade (and then you use PYTHONHASHSEED to disable
 variable hashing). That seems better than unpredictable breaking when
 some legitimate collision chain happens.


 Fair enough. But I'm now uncomfortable with turning this on for bugfix
 releases. I'm fine with making this the default in 3.3, just not in 3.2,
 3.1 or 2.x -- it will break too much code and organizations will have to
 roll back the release or do extensive testing before installing a bugfix
 release -- exactly what we *don't* want for those.

 FWIW, I don't believe in the SafeDict solution -- you never know which
 dicts you have to change.


Agreed.

Of the three options Victor listed only one is good.

I don't like *SafeDict*.  *-1*.  It puts the onerous on the coder to always
get everything right with regards to data that came from outside the
process never ending up hashed in a non-safe dict or set *anywhere*.
 Safe needs to be the default option for all hash tables.

I don't like the *too many hash collisions* exception. *-1*. It provides
non-deterministic application behavior for data driven applications with no
way for them to predict when it'll happen or where and prepare for it. It
may work in practice for many applications but is simply odd behavior.

I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be
back ported to any Python version.

It is perfectly okay to break existing users who had anything depending on
ordering of internal hash tables. Their code was already broken. We
*will*provide a flag and/or environment variable that can be set to
turn the
feature off at their own peril which they can use in their test harnesses
that are stupid enough to use doctests with order dependencies.

This approach worked fine for Perl 9 years ago.
https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Glenn Linderman

On 1/13/2012 5:35 PM, Victor Stinner wrote:

- Glenn Linderman proposes to fix the vulnerability by adding a new
safe dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.

We could mix Marc's collision counter with SafeDict idea (being able
to use a different secret for each dict): use hash(key, secret)
(simple example: hash(secret+key)) instead of hash(key) in dict (and
set), and change the secret if we have more than N collisions. But it
would slow down all dict lookup (dict creation, get, set, del, ...).
And getting new random data can also be slow.

SafeDict and hash(secret+key) lose the benefit of the cached hash
result. Because the hash result depends on a argument, we cannot cache
the result anymore, and we have to recompute the hash for each lookup
(even if you lookup the same key twice ore more).

Victor


So integrating SafeDict into dict so it could be automatically converted 
would mean changing the data structures underneath dict.  Given that, a 
technique for hash caching could be created, that isn't quite as good as 
the one in place, but may be less expensive than not caching the 
hashes.  It would also take more space, a second dict, internally, as 
well as the secret.


So once the collision counter reaches some threshold (since there would 
be a functional fallback, it could be much lower than 1000), the secret 
is obtained, and the keys are rehashed using hash(secret+key).  Now when 
lookups occur, the object id of the key and the hash of the key are used 
as the index and hash(secret+key) is stored as a cached value.  This 
would only benefit lookups by the same object, other objects with the 
same key value would be recalculated (at least the first time).  Some 
limit on the number of cached values would probably be appropriate.  
This would add complexity, of course, in trying to save time.


An alternate solution would be to convert a dict to a tree once the 
number of collisions produces poor performance.  Converting to a tree 
would result in O(log N) instead of O(1) lookup performance, but that is 
better than the degenerate case of O(N) which is produced by the 
excessive number of collisions resulting from an attack.  This would 
require new tree code to be included in the core, of course, probably a 
red-black tree, which stays balanced.


In either of these cases, the conversion is expensive, because a 
collision threshold must first be reached to determine the need for 
conversion, so the hash could already contain lots of data.  If it were 
too expensive, the attack could still be effective.


Another solution would be to change the collision code, so that 
colliding keys don't produce O(N) behavior, but some other behavior.  
Each colliding entry could convert that entry to a tree of entries, 
perhaps.  This would require no conversion of bad dicts, and an attack 
could at worst convert O(1) performance to O(log N).


Clearly these ideas are more complex than adding randomization, but 
adding randomization doesn't seem to be produce immunity from attack, 
when data about the randomness is leaked.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Gregory P. Smith


 Clearly these ideas are more complex than adding randomization, but adding
 randomization doesn't seem to be produce immunity from attack, when data
 about the randomness is leaked.


Which will not normally happen.

I'm firmly in the camp that believes the random seed can be probed and
determined by creatively injecting values and measuring timing of things.
 But doing that is difficult and time and bandwidth intensive so the per
process random hash seed is good enough.

There's another elephant in the room here, if you want to avoid this attack
use a 64-bit Python build as it uses 64-bit hash values that are
significantly more difficult to force a collision on.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Gregory P. Smith
btw, Tim's commit message on this one is amusingly relevant. :)

 http://hg.python.org/cpython/diff/8d2f2cb9/Objects/dictobject.c


On Fri, Jan 13, 2012 at 6:25 PM, Gregory P. Smith g...@krypto.org wrote:


 Clearly these ideas are more complex than adding randomization, but
 adding randomization doesn't seem to be produce immunity from attack, when
 data about the randomness is leaked.


 Which will not normally happen.

 I'm firmly in the camp that believes the random seed can be probed and
 determined by creatively injecting values and measuring timing of things.
  But doing that is difficult and time and bandwidth intensive so the per
 process random hash seed is good enough.

 There's another elephant in the room here, if you want to avoid this
 attack use a 64-bit Python build as it uses 64-bit hash values that are
 significantly more difficult to force a collision on.

 -gps

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Steven D'Aprano

On 14/01/12 12:58, Gregory P. Smith wrote:


I do like *randomly seeding the hash*. *+1*. This is easy. It can easily be
back ported to any Python version.

It is perfectly okay to break existing users who had anything depending on
ordering of internal hash tables. Their code was already broken.


For the record:

steve@runes:~$ python -c print(hash('spam ham'))
-376510515
steve@runes:~$ jython -c print(hash('spam ham'))
2054637885

So it is already the case that Python code that assumes stable hashing is 
broken.

For what it's worth, I'm not convinced that we should be overly-concerned by 
poor saps (Guido's words) who rely on accidents of implementation regarding 
hash. We shouldn't break their code unless we have a good reason, but this 
strikes me as a good reason. The documentation for hash certainly makes no 
promise about stability, and relying on it strikes me as about as sensible as 
relying on the stability of error messages.


I'm also not convinced that the option to raise an exception after 1000 
collisions actually solves the problem. That relies on the application being 
re-written to catch the exception and recover from it (how?). Otherwise, all 
it does is change the attack vector from cause an indefinite number of hash 
collisions to cause 999 hash collisions followed by crashing the application 
with an exception, which doesn't strike me as much of an improvement.


+1 on random seeding. Default to on in 3.3+ and default to off in older 
versions, which allows people to avoid breaking their code until they're ready 
for it to be broken.




--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Gregory P. Smith
On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith g...@krypto.org wrote:


 On Fri, Jan 13, 2012 at 5:38 PM, Guido van Rossum gu...@python.orgwrote:

 On Fri, Jan 13, 2012 at 5:17 PM, Antoine Pitrou solip...@pitrou.netwrote:

 On Thu, 12 Jan 2012 18:57:42 -0800
 Guido van Rossum gu...@python.org wrote:
  Hm... I started out as a big fan of the randomized hash, but thinking
 more
  about it, I actually believe that the chances of some legitimate app
 having
  1000 collisions are way smaller than the chances that somebody's code
 will
  break due to the variable hashing.

 Breaking due to variable hashing is deterministic: you notice it as
 soon as you upgrade (and then you use PYTHONHASHSEED to disable
 variable hashing). That seems better than unpredictable breaking when
 some legitimate collision chain happens.


 Fair enough. But I'm now uncomfortable with turning this on for bugfix
 releases. I'm fine with making this the default in 3.3, just not in 3.2,
 3.1 or 2.x -- it will break too much code and organizations will have to
 roll back the release or do extensive testing before installing a bugfix
 release -- exactly what we *don't* want for those.

 FWIW, I don't believe in the SafeDict solution -- you never know which
 dicts you have to change.


 Agreed.

 Of the three options Victor listed only one is good.

 I don't like *SafeDict*.  *-1*.  It puts the onerous on the coder to
 always get everything right with regards to data that came from outside the
 process never ending up hashed in a non-safe dict or set *anywhere*.
  Safe needs to be the default option for all hash tables.

 I don't like the *too many hash collisions* exception. *-1*. It
 provides non-deterministic application behavior for data driven
 applications with no way for them to predict when it'll happen or where and
 prepare for it. It may work in practice for many applications but is simply
 odd behavior.

 I do like *randomly seeding the hash*. *+1*. This is easy. It can easily
 be back ported to any Python version.

 It is perfectly okay to break existing users who had anything depending on
 ordering of internal hash tables. Their code was already broken. We 
 *will*provide a flag and/or environment variable that can be set to turn the
 feature off at their own peril which they can use in their test harnesses
 that are stupid enough to use doctests with order dependencies.


What an implementation looks like:

 http://pastebin.com/9ydETTag

some stuff to be filled in, but this is all that is really required.  add
logic to allow a particular seed to be specified or forced to 0 from the
command line or environment.  add the logic to grab random bytes.  add the
autoconf glue to disable it.  done.

-gps


 This approach worked fine for Perl 9 years ago.
 https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371

 -gps

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread martin

What an implementation looks like:

 http://pastebin.com/9ydETTag

some stuff to be filled in, but this is all that is really required.


I think this statement (and the patch) is wrong. You also need to change
the byte string hashing, at least for 2.x. This I consider the biggest
flaw in that approach - other people may have written string-like objects
which continue to compare equal to a string but now hash different.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Guido van Rossum
On Fri, Jan 13, 2012 at 5:58 PM, Gregory P. Smith g...@krypto.org wrote:

 It is perfectly okay to break existing users who had anything depending on
 ordering of internal hash tables. Their code was already broken. We 
 *will*provide a flag and/or environment variable that can be set to turn the
 feature off at their own peril which they can use in their test harnesses
 that are stupid enough to use doctests with order dependencies.


No, that is not how we usually take compatibility between bugfix releases.
Your code is already broken is not an argument to break forcefully what
worked (even if by happenstance) before. The difference between CPython and
Jython (or between different CPython feature releases) also isn't relevant
-- historically we have often bent over backwards to avoid changing
behavior that was technically undefined, if we believed it would affect a
significant fraction of users.

I don't think anyone doubts that this will break lots of code (at least,
the arguments I've heard have been their code is broken, not nobody does
that).

This approach worked fine for Perl 9 years ago.
 https://rt.perl.org/rt3//Public/Bug/Display.html?id=22371


I don't know what the Perl attitude about breaking undefined behavior
between micro versions was at the time. But ours is pretty clear -- don't
do it.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Terry Reedy

On 1/13/2012 8:58 PM, Gregory P. Smith wrote:


It is perfectly okay to break existing users who had anything depending
on ordering of internal hash tables. Their code was already broken.


Given that the doc says Return the hash value of the object, I do not 
think we should be so hard-nosed. The above clearly implies that there 
is such a thing as *the* Python hash value for an object. And indeed, 
that has been true across many versions. If we had written Return a 
hash value for the object, which can vary from run to run, the case 
would be different.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Jack Diederich
On Thu, Jan 12, 2012 at 9:57 PM, Guido van Rossum gu...@python.org wrote:
 Hm... I started out as a big fan of the randomized hash, but thinking more
 about it, I actually believe that the chances of some legitimate app having
1000 collisions are way smaller than the chances that somebody's code will
 break due to the variable hashing.

Python's dicts are designed to avoid hash conflicts by resizing and
keeping the available slots bountiful.  1000 conflicts sounds like a
number that couldn't be hit accidentally unless you had a single dict
using a terabyte of RAM (i.e. if Titus Brown doesn't object, we're
good).   The hashes also look to exploit cache locality but that is
very unlikely to get one thousand conflicts by chance.  If you get
that many there is an attack.

 This is depending on how the counting is done (I didn't look at MAL's
 patch), and assuming that increasing the hash table size will generally
 reduce collisions if items collide but their hashes are different.

The patch counts conflicts on an individual insert and not lifetime
conflicts.  Looks sane to me.

 That said, even with collision counting I'd like a way to disable it without
 changing the code, e.g. a flag or environment variable.

Agreed.  Paranoid people can turn the behavior off and if it ever were
to become a problem in practice we could point people to a solution.

-Jack
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-13 Thread Nick Coghlan
On Sat, Jan 14, 2012 at 4:24 PM, Jack Diederich jackd...@gmail.com wrote:
 This is depending on how the counting is done (I didn't look at MAL's
 patch), and assuming that increasing the hash table size will generally
 reduce collisions if items collide but their hashes are different.

 The patch counts conflicts on an individual insert and not lifetime
 conflicts.  Looks sane to me.

Having a hard limit on the worst-case behaviour certainly sounds like
an attractive prospect. And there's nothing to worry about in terms of
secrecy or sufficient randomness - by default, attackers cannot
generate more than 1000 hash collisions in one lookup, period.

 That said, even with collision counting I'd like a way to disable it without
 changing the code, e.g. a flag or environment variable.

 Agreed.  Paranoid people can turn the behavior off and if it ever were
 to become a problem in practice we could point people to a solution.

Does MAL's patch allow the limit to be set on a per-dict basis
(including setting it to None to disable collision limiting
completely)? If people have data sets that need to tolerate that kind
of collision level (and haven't already decided to move to a data
structure other than the builtin dict), then it may make sense to
allow them to remove the limit when using trusted input.

For maintenance versions though, it would definitely need to be
possible to switch it off without touching the code.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-12 Thread Victor Stinner
Many people proposed their own idea to fix the vulnerability, but only
3 wrote a patch:

- Glenn Linderman proposes to fix the vulnerability by adding a new
safe dict type (only accepting string keys). His proof-of-concept
(SafeDict.py) uses a secret of 64 random bits and uses it to compute
the hash of a key.
- Marc Andre Lemburg proposes to fix the vulnerability directly in
dict (for any key type). The patch raises an exception if a lookup
causes more than 1000 collisions.
- I propose to fix the vulnerability only in the Unicode hash (not for
other types). My patch adds a random secret initialized at startup (it
can be disabled or fixed using an environment variable).

--

I consider that Glenn's proposition is not applicable in practice
because all applications and all libraries have to be patched to use
the new safe dict type.

Some people are concerned by possible regression introduced by Marc's
proposition: his patch may raise an exception for legitimate data.

My proposition tries to be just enough secure with a low (runtime
performance) overhead. My patch becomes huge (and so backporting is
more complex), whereas Marc's patch is very simple and so trivial to
backport.

--

It is still unclear to me if the fix should be enabled by default for
Python  3.3. Because the overhead (of my patch) is low, I would
prefer to enable the fix by default, to protect everyone with a simple
Python upgrade.

I prefer to explain how to disable explicitly the randomized hash
(PYTHONHASHSEED=0) (or how to fix application bugs) to people having
troubles with randomized hash, instead of leaving the hole open by
default.

--

We might change hash() for types other than str, but it looks like web
servers are only concerned by dict with string keys.

We may use Paul's hash function if mine is not enough secure.

My patch doesn't fix the DoS, it just make the attack more complex.
The attacker cannot pregenerate data for an attack: (s)he has first to
compute the hash secret, and then compute hash collisions using the
secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
system). So I hope that computing collisions requires a lot of CPU
time (is slow) to make the attack ineffective with today computers.

--

I plan to write a nice patch for Python 3.3, then write a simpler
patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,
maybe don't create a new random.c file, maybe don't touch the test
suite while the patch breaks many tests), and finally write patches
for Python 2.6 and 2.7.

Details about my patch:

- I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)
- a new PYTHONSEED environment variable allow to control the
randomized hash: PYTHONSEED=0 disables completly the randomized hash
(restore the previous behaviour), PYTHONSEED=value uses a fixed seed
for processes sharing data and needind same hash values
(multiprocessing users?)
- no overhead on hash(str)
- no startup overhead on Linux
- startup overhead is 10% on Windows (see the issue, I propose another
solution with a startup overhead of 1%)

The patch is not done, some tests are still failing because of the
randomized hash.

--

FYI, PHP released a version 5.3.9 adding max_input_vars directive to
prevent attacks based on hash collisions (CVE-2011-4885).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of the fix for the hash collision vulnerability

2012-01-12 Thread Guido van Rossum
Hm... I started out as a big fan of the randomized hash, but thinking more
about it, I actually believe that the chances of some legitimate app having
1000 collisions are way smaller than the chances that somebody's code will
break due to the variable hashing. In fact we know for a fact that the
latter will break code, since it changes the order of items in a dict. This
affects many tests written without this in mind, and I assume there will be
some poor sap out there who uses Python's hash() function to address some
external persistent hash table or some other external datastructure. How
pathological the data needs to be before the collision counter triggers?
I'd expect *very* pathological.

This is depending on how the counting is done (I didn't look at MAL's
patch), and assuming that increasing the hash table size will generally
reduce collisions if items collide but their hashes are different.

That said, even with collision counting I'd like a way to disable it
without changing the code, e.g. a flag or environment variable.

--Guido

On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner 
victor.stin...@haypocalc.com wrote:

 Many people proposed their own idea to fix the vulnerability, but only
 3 wrote a patch:

 - Glenn Linderman proposes to fix the vulnerability by adding a new
 safe dict type (only accepting string keys). His proof-of-concept
 (SafeDict.py) uses a secret of 64 random bits and uses it to compute
 the hash of a key.
 - Marc Andre Lemburg proposes to fix the vulnerability directly in
 dict (for any key type). The patch raises an exception if a lookup
 causes more than 1000 collisions.
 - I propose to fix the vulnerability only in the Unicode hash (not for
 other types). My patch adds a random secret initialized at startup (it
 can be disabled or fixed using an environment variable).

 --

 I consider that Glenn's proposition is not applicable in practice
 because all applications and all libraries have to be patched to use
 the new safe dict type.

 Some people are concerned by possible regression introduced by Marc's
 proposition: his patch may raise an exception for legitimate data.

 My proposition tries to be just enough secure with a low (runtime
 performance) overhead. My patch becomes huge (and so backporting is
 more complex), whereas Marc's patch is very simple and so trivial to
 backport.

 --

 It is still unclear to me if the fix should be enabled by default for
 Python  3.3. Because the overhead (of my patch) is low, I would
 prefer to enable the fix by default, to protect everyone with a simple
 Python upgrade.

 I prefer to explain how to disable explicitly the randomized hash
 (PYTHONHASHSEED=0) (or how to fix application bugs) to people having
 troubles with randomized hash, instead of leaving the hole open by
 default.

 --

 We might change hash() for types other than str, but it looks like web
 servers are only concerned by dict with string keys.

 We may use Paul's hash function if mine is not enough secure.

 My patch doesn't fix the DoS, it just make the attack more complex.
 The attacker cannot pregenerate data for an attack: (s)he has first to
 compute the hash secret, and then compute hash collisions using the
 secret. The hash secret is a least 64 bits long (128 bits on a 64 bit
 system). So I hope that computing collisions requires a lot of CPU
 time (is slow) to make the attack ineffective with today computers.

 --

 I plan to write a nice patch for Python 3.3, then write a simpler
 patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged,
 maybe don't create a new random.c file, maybe don't touch the test
 suite while the patch breaks many tests), and finally write patches
 for Python 2.6 and 2.7.

 Details about my patch:

 - I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits)
 - a new PYTHONSEED environment variable allow to control the
 randomized hash: PYTHONSEED=0 disables completly the randomized hash
 (restore the previous behaviour), PYTHONSEED=value uses a fixed seed
 for processes sharing data and needind same hash values
 (multiprocessing users?)
 - no overhead on hash(str)
 - no startup overhead on Linux
 - startup overhead is 10% on Windows (see the issue, I propose another
 solution with a startup overhead of 1%)

 The patch is not done, some tests are still failing because of the
 randomized hash.

 --

 FYI, PHP released a version 5.3.9 adding max_input_vars directive to
 prevent attacks based on hash collisions (CVE-2011-4885).

 Victor
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: