[Python-Dev] Counting collisions for the win

2012-02-16 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-January/115715.html
Frank Sievertsen wrote:

Am 20.01.2012 13:08, schrieb Victor Stinner:
 I'm surprised we haven't seen bug reports about it from users
 of 64-bit Pythons long ago
 A Python dictionary only uses the lower bits of a hash value. If your
 dictionary has less than 2**32 items, the dictionary order is exactly
 the same on 32 and 64 bits system: hash32(str)  mask == hash64(str)
 mask for mask= 2**32-1.

 No, that's not true.
 Whenever a collision happens, other bits are mixed in very fast.

 Frank

Bits are mixed in quickly from a denial-of-service standpoint, but
Victor is correct from a Why don't the tests already fail? standpoint.

A dict with 2**12 slots, holding over 2700 entries, will be far larger
than most test cases -- particularly those with visible output.  In a
dict that size, 32-bit and 64-bit machines will still probe the same
first, second, third, fourth, fifth, and sixth slots.  Even on the
rare cases when there are at least 6 collisions, the next slots may
well be either the same, or close enough that it doesn't show up in a
changed iteration order.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-24 Thread Frank Sievertsen


Interesting idea, and I see it would avoid conversions.  What happens 
if the data area also removed from the hash?  So you enter 20 
colliding keys, then 20 more that get randomized, then delete the 18 
of the first 20.  Can you still find the second 20 keys? Takes two 
sets of probes, somehow?



That's no problem, because the dict doesn't really free a slot, it
replaces the values with a dummy-values.

These places are later reused for new values or the whole dict is 
recreated and

resized.

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Frank Sievertsen

Hello,

I'd still prefer to see a randomized hash()-function (at least for 3.3).

But to protect against the attacks it would be sufficient to use
randomization for collision resolution in dicts (and sets).

What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached.

The benefits:

* protection against the known attacks
* hash(X) stays stable and the same
* dict order is only changed when there are many collisions
* doctests will not break
* enhanced collision resolution
* RNG doesn't have to be initialized in smaller programs
* nearly no slowdown of most dicts
* second hash-function is only used for keys with higher collision-rate
* lower probability to leak secrets
* possibility to use different secrets for each dict

The drawback:

* need to add a second hash-function
* slower than using one hash-function only, when  20 collisions
* need to add this to container-types? (if used for py3.3)
* need to expose this to the user? (if used for py3.3)
* works only for datatypes with this new function
* possible to implement without breaking ABI?

The following code is meant for explanation purpose only:

for(perturb = hash; ; perturb = 5) {
i = (i  2) + i + perturb + 1;

if((collisions++) == 20) {
// perturb is already zero after 13 rounds.
// 20 collisions are rare.

// you can add  (ma_mask  256) to make 100% sure
// that it's not used for smaller dicts.

if(Py_TYPE(key)-tp_flags  Py_TPFLAGS_HAVE_RANDOMIZED_HASH) {
// If type has a randomized hash, use this now for lookup
i = perturb = PyObject_RandomizedHash(key));
}
   .


If I got this right we could add a new function tp_randomized_hash
to 3.3 release. But can we also add functions in older releases, without
breaking ABI?

If not, can we implement this somehow using a flag?

FOR OLDER RELEASE  3.3:

Py_hash_t
PyObject_RandomizedHash(PyVarObject *o) {
PyTypeObject *tp = Py_TYPE(v);
if(! (tp-tp_flags  Py_TPFLAGS_HAVE_RANDOMIZED_HASH))
return -1;

global_flags_somewhere-USE_RANDOMIZED_HASH = 1;
return (*tp-tp_hash)(v);
}

 and in unicodeobject.c: (and wherever we need randomization)

static Py_hash_t
unicode_hash(PyUnicodeObject *self)
{
Py_ssize_t len;
Py_UNICODE *p;
Py_hash_t x;
Py_hash_t prefix=0;
Py_hash_t suffix=0;

if(global_flags_somewhere-USE_RANDOMIZED_HASH) {
global_flags_somewhere-USE_RANDOMIZED_HASH = 0;
initialize_rng_if_not_already_done_and_return_seed(prefix, 
suffix);


. (and don't cache in this case) .


It's ugly, but if I understand this correctly, the GIL will
protect us against race-conditions, right?

Hello, internals experts: Would this work or is there a better way to do
this without breaking the ABI?

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Glenn Linderman

On 1/23/2012 12:53 AM, Frank Sievertsen wrote:


What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached. 


So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.


Compared to SafeDict as a programmer tool, fallback-dict has these benefits:

* No need to change program (or library) source to respond to an attack
* Order is preserved until the collision threshold has been reached
* Performance is preserved until the collision threshold has been reached

and costs:

* converting the dict from one hash to the other by rehashing all the keys.

Compared to always randomizing the hash, fallback-dict has these benefits:

* hash (and perfomance) is deterministic: programs running on the same 
data set will have the same performance characteristic, unless the 
collision threshold is reached for that data set.
* lower probability to leak secrets, because each attacked set/dict can 
have its own secret, randomized hash seed
* patch would not need to include RNG initialization during startup, 
lowering the impact on short-running programs.


What is not clear is how much SafeDict degrades performance when it is 
used; non-cached hashes will definitely have an impact.  I'm not sure 
whether an implementation of fallback-dict in C code, would be 
significantly faster than the implementation of SafeDict in Python, to 
know whether comparing the performance of SafeDict and dict would be 
representative of the two stages of fallback-dict performance, but 
certainly the performance cost of SafeDict would be an upper bound on 
the performance cost of fallback-dict, once conversion takes place, but 
would not measure the conversion cost.  The performance of fallback-dict 
does have to be significantly better than the performance of dict with 
collisions to be beneficial, but if the conversion cost is significant, 
triggering conversions could be an attack vector.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Frank Sievertsen



On 23.01.2012 19:25, Glenn Linderman wrote:
So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.


and costs:

* converting the dict from one hash to the other by rehashing all the 
keys.


That's not exactly what it does, it calls the randomized hash-function 
only for those
keys, that that didn't find a free slot after 20 collision. And it uses 
this value only for

the further collision resolution.

So the value of hash() is used for the first 20 slots, randomized_hash() 
is used

after that.

1st try: slot[i = perturb = hash];
2nd try: slot[i=i * 5 + 1 + (perturb = 5)]
3rd try: slot[i=i * 5 + 1 + (perturb = 5)]

20th try: slot[i= i * 5 + 1 + (perturb = 5)]
21th try: slot[i= perturb = randomized_hash(key)]  HERE
22th try: slot[i= i * 5 + 1 + (perturb = 5)]

This is also why there is no conversion needed. It's a
per-key/per-lookup rule.

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Glenn Linderman

On 1/23/2012 10:58 AM, Frank Sievertsen wrote:



On 23.01.2012 19:25, Glenn Linderman wrote:
So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.


and costs:

* converting the dict from one hash to the other by rehashing all the 
keys.


That's not exactly what it does, it calls the randomized hash-function 
only for those
keys, that that didn't find a free slot after 20 collision. And it 
uses this value only for

the further collision resolution.

So the value of hash() is used for the first 20 slots, 
randomized_hash() is used

after that.

1st try: slot[i = perturb = hash];
2nd try: slot[i=i * 5 + 1 + (perturb = 5)]
3rd try: slot[i=i * 5 + 1 + (perturb = 5)]

20th try: slot[i= i * 5 + 1 + (perturb = 5)]
21th try: slot[i= perturb = randomized_hash(key)]  HERE
22th try: slot[i= i * 5 + 1 + (perturb = 5)]

This is also why there is no conversion needed. It's a
per-key/per-lookup rule.

Frank


Interesting idea, and I see it would avoid conversions.  What happens if 
the data area also removed from the hash?  So you enter 20 colliding 
keys, then 20 more that get randomized, then delete the 18 of the first 
20.  Can you still find the second 20 keys? Takes two sets of probes, 
somehow?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread M.-A. Lemburg
Frank Sievertsen wrote:
 Hello,
 
 I'd still prefer to see a randomized hash()-function (at least for 3.3).
 
 But to protect against the attacks it would be sufficient to use
 randomization for collision resolution in dicts (and sets).
 
 What if we use a second (randomized) hash-function in case there
 are many collisions in ONE lookup. This hash-function is used only
 for collision resolution and is not cached.

This sounds a lot like what I'm referring to as universal hash function
in the discussion on the ticket:

http://bugs.python.org/issue13703#msg150724
http://bugs.python.org/issue13703#msg150795
http://bugs.python.org/issue13703#msg151813

However, I don't like the term random in there. It's better to make
the approach deterministic to avoid issues with not being able
to easily reproduce Python application runs for debugging purposes.

If you find that the data is manipulated, simply incrementing the
universal hash parameter and rehashing the dict with that parameter
should be enough to solve the issue (if not, which is highly unlikely,
the dict will simply reapply the fix). No randomness needed.

BTW: I attached a demo script to the ticket which demonstrates both
types of collisions using integers.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2012)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Janzert

On 1/23/2012 1:25 PM, Glenn Linderman wrote:

On 1/23/2012 12:53 AM, Frank Sievertsen wrote:


What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached.


So this sounds like SafeDict, but putting it under the covers and
automatically converting from dict to SafeDict after a collision
threshold has been reached.  Let's call it fallback-dict.



If you're going to essentially switch data structures dynamically 
anyway, why not just switch to something that doesn't have n**2 worse 
case performance?


Janzert

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Victor Stinner
 This seed is chosen randomly at runtime, but cannot
 change once chosen.

The hash is used to compare objects: if hash(obj1) != hash(obj2),
objects are considered different. So two strings must have the same
hash if their value is the same.

 Salt could also be an appropriate term here, but since salt is
 generally changed on a per-use basis (a single process may use many
 different salts), seed is more correct, since this value is only
 chosen once per process.

We may use a different salt per dictionary.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Lennart Regebro
On Sun, Jan 22, 2012 at 11:11, Victor Stinner
victor.stin...@haypocalc.com wrote:
 This seed is chosen randomly at runtime, but cannot
 change once chosen.

 The hash is used to compare objects: if hash(obj1) != hash(obj2),
 objects are considered different. So two strings must have the same
 hash if their value is the same.

 Salt could also be an appropriate term here, but since salt is
 generally changed on a per-use basis (a single process may use many
 different salts), seed is more correct, since this value is only
 chosen once per process.

 We may use a different salt per dictionary.

Can we do that? I was thinking of ways to not raise errors when we get
over a collision count, but instead somehow change the way the
dictionary behaves when we get over the collision count, but I
couldn't come up with something. Somehow adding a salt would be one
possibility. But I don't see how it's doable except for the
string-keys only case mentioned before.

But I might just be lacking imagination. :-)

//Lennart
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Antoine Pitrou

I think this thread is approaching the recursion limit.
Be careful not to blow the stack :)

Regards

Antoine.


On Sun, 22 Jan 2012 20:53:41 +0100
Lennart Regebro rege...@gmail.com wrote:
 On Sun, Jan 22, 2012 at 11:11, Victor Stinner
 victor.stin...@haypocalc.com wrote:
  This seed is chosen randomly at runtime, but cannot
  change once chosen.
 
  The hash is used to compare objects: if hash(obj1) != hash(obj2),
  objects are considered different. So two strings must have the same
  hash if their value is the same.
 
  Salt could also be an appropriate term here, but since salt is
  generally changed on a per-use basis (a single process may use many
  different salts), seed is more correct, since this value is only
  chosen once per process.
 
  We may use a different salt per dictionary.
 
 Can we do that? I was thinking of ways to not raise errors when we get
 over a collision count, but instead somehow change the way the
 dictionary behaves when we get over the collision count, but I
 couldn't come up with something. Somehow adding a salt would be one
 possibility. But I don't see how it's doable except for the
 string-keys only case mentioned before.
 
 But I might just be lacking imagination. :-)
 
 //Lennart


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Paul McMillan
 We may use a different salt per dictionary.

If we're willing to re-hash everything on a per-dictionary basis. That
doesn't seem reasonable given our existing usage.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Lennart Regebro
On Mon, Jan 23, 2012 at 06:02, Paul McMillan p...@mcmillan.ws wrote:
 We may use a different salt per dictionary.

 If we're willing to re-hash everything on a per-dictionary basis. That
 doesn't seem reasonable given our existing usage.

Well, if we get crazy amounts of collisions, re-hashing with a new
salt to get rid of those collisions seems quite reasonable to me...

//Lennart
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Stephen J. Turnbull
Lennart Regebro writes:
  On Mon, Jan 23, 2012 at 06:02, Paul McMillan p...@mcmillan.ws wrote:
   We may use a different salt per dictionary.
  
   If we're willing to re-hash everything on a per-dictionary basis. That
   doesn't seem reasonable given our existing usage.
  
  Well, if we get crazy amounts of collisions, re-hashing with a new
  salt to get rid of those collisions seems quite reasonable to me...

But doesn't the whole idea of a hash table fall flat on its face if
you need to worry about crazy amounts of collisions (outside of
deliberate attacks)?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-22 Thread Tim Delaney
On 23 January 2012 16:49, Lennart Regebro rege...@gmail.com wrote:

 On Mon, Jan 23, 2012 at 06:02, Paul McMillan p...@mcmillan.ws wrote:
  We may use a different salt per dictionary.
 
  If we're willing to re-hash everything on a per-dictionary basis. That
  doesn't seem reasonable given our existing usage.

 Well, if we get crazy amounts of collisions, re-hashing with a new
 salt to get rid of those collisions seems quite reasonable to me...


Actually, this looks like it has the seed of a solution in it. I haven't
scrutinised the following beyond it sounds like it could work - it could
well contain nasty flaws.

Assumption: We only get an excessive number of collisions during an attack
(directly or indirectly).
Assumption: Introducing a salt into hashes will change those hashes
sufficiently to mitigate the attack (all discussion of randomising hashes
makes this assumption).

1. Keep the current hashing (for all dictionaries) i.e. just using
hash(key).

2. Count collisions.

3. If any key hits X collisions change that dictionary to use a random salt
for hashes (at least for str and unicode keys). This salt would be
remembered for the dictionary.

Consequence: The dictionary would need to be rebuilt when an attack was
detected.
Consequence: Hash caching would no longer occur for this dictionary, making
most operations more expensive.
Consequence: Anything relying on the iteration order of a dictionary which
has suffered excessive conflicts would fail.

4. (Optional) in 3.3, provide a way to get a dictionary with random salt
(i.e. not wait for attack).

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Matthew Woodcraft
Victor Stinner  victor.stin...@haypocalc.com wrote:
 I propose to solve the hash collision vulnerability by counting
 collisions [...]

 We now know all issues of the randomized hash solution, and I
 think that there are more drawbacks than advantages. IMO the
 randomized hash is overkill to fix the hash collision issue.


For web frameworks, forcing an exception is less harmful than forcing a
many-second delay, but I think it's hard to be confident that there
aren't other vulnerable applications where it's the other way round.


Web frameworks like the exception because they already have backstop
exception handlers, and anyway they use short-lived processes and keep
valuable data in databases rather than process memory.

Web frameworks don't like the delay because they allow unauthenticated
users to submit many requests (including multiple requests in parallel),
and they normally expect each response to take little cpu time.


But many programs are not like this.

What about a log analyser or a mailing list archiver or a web crawler or
a game server or some other kind of program we haven't considered?

-M-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Guido van Rossum
On Sat, Jan 21, 2012 at 7:50 AM, Matthew Woodcraft
matt...@woodcraft.me.uk wrote:
 Victor Stinner  victor.stin...@haypocalc.com wrote:
 I propose to solve the hash collision vulnerability by counting
 collisions [...]

 We now know all issues of the randomized hash solution, and I
 think that there are more drawbacks than advantages. IMO the
 randomized hash is overkill to fix the hash collision issue.


 For web frameworks, forcing an exception is less harmful than forcing a
 many-second delay, but I think it's hard to be confident that there
 aren't other vulnerable applications where it's the other way round.


 Web frameworks like the exception because they already have backstop
 exception handlers, and anyway they use short-lived processes and keep
 valuable data in databases rather than process memory.

 Web frameworks don't like the delay because they allow unauthenticated
 users to submit many requests (including multiple requests in parallel),
 and they normally expect each response to take little cpu time.


 But many programs are not like this.

 What about a log analyser or a mailing list archiver or a web crawler or
 a game server or some other kind of program we haven't considered?

If my log crawler ended up taking minutes per log entry instead of
milliseconds I'd have to kill it anyway. Web crawlers are huge
multi-process systems that are as robust as web servers, or more. Game
servers are just web apps.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread David Malcolm
On Fri, 2012-01-20 at 16:55 +0100, Frank Sievertsen wrote:
 Hello,
 
 I still see at least two ways to create a DOS attack even with the
 collison-counting-patch.

[snip description of two types of attack on the collision counting
approach]

 What to do now?
 I think it's not smart to reduce the number of allowed collisions 
 dramatically
 AND count all slot-collisions at the same time.

Frank: did you see the new approach I proposed in:
http://bugs.python.org/issue13703#msg151735
http://bugs.python.org/file24289/amortized-probe-counting-dmalcolm-2012-01-21-003.patch

(repurposes the ma_smalltable region of large dictionaries to add
tracking of each such dict's average iterations taken per modification,
and raise an exception when it exceeds a particular ratio)

I'm interested in hearing how it holds up against your various test
cases, or what flaws there are in it.

Thanks!
Dave

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Jared Grubb
On 20 Jan 2012, at 10:49, Brett Cannon wrote:
 Why can't we have our cake and eat it too?
 
 Can we do hash randomization in 3.3 and use the hash count solution for 
 bugfix releases? That way we get a basic fix into the bugfix releases that 
 won't break people's code (hopefully) but we go with a more thorough (and IMO 
 correct) solution of hash randomization starting with 3.3 and moving forward. 
 We aren't breaking compatibility in any way by doing this since it's a 
 feature release anyway where we change tactics. And it can't be that much 
 work since we seem to have patches for both solutions. At worst it will make 
 merging commits for those files affected by the patches, but that will most 
 likely be isolated and not a common collision (and less of any issue once 3.3 
 is released later this year).
 
 I understand the desire to keep backwards-compatibility, but collision 
 counting could cause an error in some random input that someone didn't expect 
 to cause issues whether they were under a DoS attack or just had some 
 unfortunate input from private data. The hash randomization, though, is only 
 weak if someone is attacked, not if they are just using Python with their own 
 private data.

I agree; it sounds really odd to throw an exception since nothing is actually 
wrong and there's nothing the caller would do about it to recover anyway. 
Rather than throwing an exception, maybe you just reseed the random value for 
the hash:
 * this would solve the security issue that someone mentioned about being able 
to deduce the hash because if they keep being mean it'll change anyway
 * for bugfix, start off without randomization (seed==0) and start to use it 
only when the collision count hits the threshold
 * for release, reseeding when you hit a certain threshold still seems like a 
good idea as it will make lookups/insertions better in the long-run

AFAIUI, Python already doesnt guarantee order stability when you insert 
something into a dictionary, as in the worst case the dictionary has to resize 
its hash table, and then the order is freshly jumbled again.

Just my two cents.

Jared
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Paul McMillan
On Sat, Jan 21, 2012 at 4:19 PM, Jared Grubb jared.gr...@gmail.com wrote:
 I agree; it sounds really odd to throw an exception since nothing is actually 
 wrong and there's nothing the caller would do about it to recover anyway. 
 Rather than throwing an exception, maybe you just reseed the random value for 
 the hash

This is nonsense. You have to determine the random seed at startup,
and it has to be uniform for the entire life of the process. You can't
change it after Python has started.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Steven D'Aprano

Paul McMillan wrote:

On Sat, Jan 21, 2012 at 4:19 PM, Jared Grubb jared.gr...@gmail.com wrote:

I agree; it sounds really odd to throw an exception since nothing is actually 
wrong and there's nothing the caller would do about it to recover anyway. 
Rather than throwing an exception, maybe you just reseed the random value for 
the hash


This is nonsense. You have to determine the random seed at startup,
and it has to be uniform for the entire life of the process. You can't
change it after Python has started.


I may have a terminology problem here. I expect that a random seed must change 
every time it is used, otherwise the pseudorandom number generator using it 
just returns the same value each time. Should we be talking about a salt 
rather than a seed?




--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-21 Thread Paul McMillan
 I may have a terminology problem here. I expect that a random seed must
 change every time it is used, otherwise the pseudorandom number generator
 using it just returns the same value each time. Should we be talking about a
 salt rather than a seed?

You should read the several other threads, the bug, as well as the
implementation and patch under discussion. Briefly, Python string
hashes are calculated once per string, and then used in many places.
You can't change the hash value for a string during program execution
without breaking everything. The proposed change modifies the starting
value of the hash function to include a process-wide randomly
generated seed. This seed is chosen randomly at runtime, but cannot
change once chosen. Using the seed changes the final output of the
hash to be unpredictable to an attacker, solving the underlying
problem.

Salt could also be an appropriate term here, but since salt is
generally changed on a per-use basis (a single process may use many
different salts), seed is more correct, since this value is only
chosen once per process.

-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Martin v. Löwis
 The last solution is very simple: count collision and raise an
 exception if it hits a limit. The path is something like 10 lines
 whereas the randomized hash is more close to 500 lines, add a new
 file, change Visual Studio project file, etc. First I thaught that it
 would break more applications than the randomized hash

The main issue with that approach is that it allows a new kind of attack.

An attacker now needs to find 1000 colliding keys, and submit them
one-by-one into a database. The limit will not trigger, as those are
just database insertions.

Now, if the applications also as a need to read the entire database
table into a dictionary, that will suddenly break, and not for the
attacker (which would be ok), but for the regular user of the
application or the site administrator.

So it may be that this approach actually simplifies the attack, making
the cure worse than the disease.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Frank Sievertsen

The main issue with that approach is that it allows a new kind of attack.


Indeed, I posted another example: http://bugs.python.org/msg151677

This kind of fix can be used in a specific application or maybe in a
special-purpose framework, but not on the level of a general-purpose
language.

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Nick Coghlan
On Fri, Jan 20, 2012 at 7:34 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 The main issue with that approach is that it allows a new kind of attack.

 An attacker now needs to find 1000 colliding keys, and submit them
 one-by-one into a database. The limit will not trigger, as those are
 just database insertions.

 Now, if the applications also as a need to read the entire database
 table into a dictionary, that will suddenly break, and not for the
 attacker (which would be ok), but for the regular user of the
 application or the site administrator.

 So it may be that this approach actually simplifies the attack, making
 the cure worse than the disease.

Ouch, I think you're right. So hash randomisation may be the best
option, and admins will need to test for themselves to see if it
breaks things...

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Victor Stinner
2012/1/20 Ivan Kozik i...@ludios.org:
 I'd like to point out that an attacker is not limited to sending just
 one dict full of colliding keys.  Given a 22ms stall ...

The presented attack produces a stall of at least 30 seconds (5
minutes or more if there is no time limit in the application), not
0.022 second. You have to send a lot of requests to produce a DoS if a
single requests just eat 22 ms. I suppose that there are a lot of
other kinds of request than takes much longer than 22 ms, even valid
requests.

 Another issue is that even with a configurable limit, different
 modules can't have their own limits.  One module might want a
 relatively safe raise-at-100, and another module creating massive
 dicts might want raise-at-1000.  How does a developer know whether
 they can raise or lower the limit, given that they use a bunch of
 different modules?

Python becomes really slow when you have more than N collisions
(O(n^2) problem). If an application hits this limit with valid data,
it is time to use another data structure or use a different hash
function. We have to do more tests to choose correctly N, but
according my first tests, it looks like N=1000 is a safe limit.

Marc Andre's patch doesn't count all collisions, but only
collisions requiring to compare objects. When two objects have the
same hash value, the open addressing algorithm searchs a free bucket.
If a bucket is not free but has a different hash value, the objects
are not compared and the collision counter is not incremented. The
limit is only reached when you have N objects having the same hash
value modulo the size of the bucket (hash(str)  DICT_MASK).

When there are not enough empty buckets (it comes before all buckets
are full), Python resizes the dictionary (it does something like size
= size * 2) and so it uses at least one more bit each time than the
dictionary is resized. Collisions are very likely with a small
dictioanry, but becomes more rare each time than the dictionary is
resized. It means that the number of potential collisions (with valid
data) decreases when the dictionary grows. Tell me if I am wrong.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Victor Stinner
 I'm surprised we haven't seen bug reports about it from users
 of 64-bit Pythons long ago

A Python dictionary only uses the lower bits of a hash value. If your
dictionary has less than 2**32 items, the dictionary order is exactly
the same on 32 and 64 bits system: hash32(str)  mask == hash64(str) 
mask for mask = 2**32-1.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Frank Sievertsen

No, that's not true.
Whenever a collision happens, other bits are mixed in very fast.

Frank

Am 20.01.2012 13:08, schrieb Victor Stinner:

I'm surprised we haven't seen bug reports about it from users
of 64-bit Pythons long ago

A Python dictionary only uses the lower bits of a hash value. If your
dictionary has less than 2**32 items, the dictionary order is exactly
the same on 32 and 64 bits system: hash32(str)  mask == hash64(str)
mask for mask= 2**32-1.
_


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Victor Stinner
2012/1/20 Frank Sievertsen fr...@sievertsen.de:
 No, that's not true.
 Whenever a collision happens, other bits are mixed in very fast.

Oh, I didn't know that. So the dict order is only the same if there is
no collision.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Victor Stinner
 The main issue with that approach is that it allows a new kind of attack.

 An attacker now needs to find 1000 colliding keys, and submit them
 one-by-one into a database. The limit will not trigger, as those are
 just database insertions.

 Now, if the applications also as a need to read the entire database
 table into a dictionary, that will suddenly break, and not for the
 attacker (which would be ok), but for the regular user of the
 application or the site administrator.

Oh, good catch. But it would not call it a new kind of attack, it is
just a particular case of the hash collision vulnerability.

Counting collision doesn't solve this case, but it doesn't make the
situation worse than before. Raising quickly an exception is better
than stalling for minutes, even if I agree than it is not the best
behaviour. The best would is to answer quickly with the expected
result :-) (using a different data structure or a different hash
function?)

Right now, I don't see any counter measure against this case.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Barry Warsaw
On Jan 20, 2012, at 01:50 PM, Victor Stinner wrote:

Counting collision doesn't solve this case, but it doesn't make the
situation worse than before. Raising quickly an exception is better
than stalling for minutes, even if I agree than it is not the best
behaviour.

ISTM that adding the possibility of raising a new exception on dictionary
insertion is *more* backward incompatible than changing dictionary order,
which for a very long time has been known to not be guaranteed.  You're
running some application, you upgrade Python because you apply all security
fixes, and suddenly you're starting to get exceptions in places you can't
really do anything about.  Yet those exceptions are now part of the documented
public API for dictionaries.  This is asking for trouble.  Bugs will suddenly
start appearing in that application's tracker and they will seem to the
application developer like Python just added a new public API in a security
release.

OTOH, if you change dictionary order and *that* breaks the application, then
the bugs submitted to the application's tracker will be legitimate bugs that
have to be fixed even if nothing else changed.

So I still think we should ditch the paranoia about dictionary order changing,
and fix this without counting.  A little bit of paranoia could creep back in
by disabling the hash fix by default in stable releases, but I think it would
be fine to make that a compile-time option.

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Barry Warsaw
On Jan 20, 2012, at 03:18 PM, Nick Coghlan wrote:

On Fri, Jan 20, 2012 at 2:54 PM, Carl Meyer c...@oddbird.net wrote:
 I don't have the expertise to speak otherwise to the alternatives for
 fixing the collisions vulnerability, but I don't believe it's accurate
 to presume that Django would not want to fix a dict-ordering dependency,
 and use that as a justification for one approach over another.

It's more a matter of wanting deployment of a security fix to be as
painless as possible - a security fix that system administrators can't
deploy because it breaks critical applications may as well not exist.

True, but collision counting is worse IMO.  It's just as likely (maybe) that
an application would start getting new exceptions on dictionary insertion, as
they would failures due to dictionary order changes.  Unfortunately, in the
former case it's because Python just added a new public API in a security
release (the new exception *is* public API).  In the latter case, no new API
was added, but something exposed an already existing bug in the application.
That's still a bug in the application even if counting was added.  It's also a
bug that any number of changes in the environment, or OS vendor deployment,
could have triggered.

-1 for collision counting.

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Barry Warsaw
On Jan 20, 2012, at 03:15 PM, Nick Coghlan wrote:

With the 1000 collision limit in place, the attacker sends their
massive request, the affected dict quickly hits the limit, throws an
unhandled exception which is then caught by the web framework and
turned into a 500 Error response (or whatever's appropriate for the
protocol being attacked).

Let's just be clear about it: this exception is new public API.  Changing
dictionary order is not.

For me, that comes down firmly on the side of the latter rather than the
former for stable releases.

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Antoine Pitrou
On Fri, 20 Jan 2012 13:50:18 +0100
Victor Stinner victor.stin...@haypocalc.com wrote:

  The main issue with that approach is that it allows a new kind of attack.
 
  An attacker now needs to find 1000 colliding keys, and submit them
  one-by-one into a database. The limit will not trigger, as those are
  just database insertions.
 
  Now, if the applications also as a need to read the entire database
  table into a dictionary, that will suddenly break, and not for the
  attacker (which would be ok), but for the regular user of the
  application or the site administrator.
 
 Oh, good catch. But it would not call it a new kind of attack, it is
 just a particular case of the hash collision vulnerability.
 
 Counting collision doesn't solve this case, but it doesn't make the
 situation worse than before. Raising quickly an exception is better
 than stalling for minutes, even if I agree than it is not the best
 behaviour.

Actually, it *is* worse because stalling for seconds or minutes may not
be a problem in some cases (e.g. some batch script that gets run
overnight).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 1:34 AM, Martin v. Löwis mar...@v.loewis.dewrote:

  The last solution is very simple: count collision and raise an
  exception if it hits a limit. The path is something like 10 lines
  whereas the randomized hash is more close to 500 lines, add a new
  file, change Visual Studio project file, etc. First I thaught that it
  would break more applications than the randomized hash

 The main issue with that approach is that it allows a new kind of attack.

 An attacker now needs to find 1000 colliding keys, and submit them
 one-by-one into a database. The limit will not trigger, as those are
 just database insertions.

 Now, if the applications also as a need to read the entire database
 table into a dictionary, that will suddenly break, and not for the
 attacker (which would be ok), but for the regular user of the
 application or the site administrator.

 So it may be that this approach actually simplifies the attack, making
 the cure worse than the disease.


It would be a pretty lousy app that tried to load the contents of an entire
database into a dict. It seems that this would require much more knowledge
of what the app is trying to do before a successful attack can be mounted.
So I don't think this is worse than the original attack -- I think it
requires much more ingenuity of an attacker. (I'm thinking that the
original attack is trivial once the set of 65000 colliding keys is public
knowledge, which must be only a matter of time.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Frank Sievertsen

Hello,

I still see at least two ways to create a DOS attack even with the
collison-counting-patch.

I assumed that it's possible to send ~500KB of payload to the
application.

1. It's fully deterministic which slots the dict will lookup.
Since we don't count slot-collisions, but only hash-value-collisions
this can be exploited easily by creating strings with the hash-values
along the lookup-way of an arbitrary (short) string.

So first we pick an arbitrary string. Then calculate which slots it will
visit on the way to the first empty slot. Then we create strings with
hash-values for these slots.

This attack first injects the strings to fill all the slots that the
one short string will want to visit. Then it adds THE SAME string
again and again. Since the entry is already there, nothing will be added
and no additional collisions happen, no exception raised.

$ ls -l super.txt
-rw-r--r-- 1 fx5 fx5 52 20. Jan 10:19 super.txt
$ tail -n3 super.txt
FX5
FX5
FX5
$ wc -l super.txt
9 super.txt
$ time python -c 'dict((unicode(l[:-1]), 0)  for l in open(super.txt))'
real0m52.724s
user0m51.543s
sys0m0.028s

2. The second attack actually attacks that 1000 allowed string
comparisons are still a lot of work.
First I added 999 strings that collide with a one-byte string a. In
some applications a zero-byte string might work even better. Then I
can add a many thousand of the a's, just like the first attack.

$ ls -l 1000.txt
-rw-r--r-- 1 fx5 fx5 50 20. Jan 16:15 1000.txt
$ head -n 3 1000.txt
7hLci00
4wVFm10
_rZJU50
$ wc -l 1000.txt
247000 1000.txt
$ tail -n 3 1000.txt
a
a
a
$ time python -c 'dict((unicode(l[:-1]), 0)  for l in open(1000.txt))'
real0m17.408s
user0m15.897s
sys0m0.008s

Of course the first attack is far more efficient. One could argue
that 16 seconds is not enough for an attack. But maybe it's possible
to send 1MB, have zero-bytes strings, and since for example django
does 5 lookups per query-string this will keep it busy for ~80 seconds on
my pc.

What to do now?
I think it's not smart to reduce the number of allowed collisions 
dramatically

AND count all slot-collisions at the same time.

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Victor Stinner
 (I'm thinking that the original
 attack is trivial once the set of 65000 colliding keys is public knowledge,
 which must be only a matter of time.)

I have a program able to generate collisions: it takes 1 second to
compute 60,000 colliding strings on a desktop computer. So the
security of the randomized hash is based on the fact than the attacker
cannot compute the secret.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Victor Stinner
 So I still think we should ditch the paranoia about dictionary order changing,
 and fix this without counting.

The randomized hash has other issues:

 - its security is based on its secret, whereas it looks to be easy to
compute it (see more details in the issue)
 - my patch only changes hash(str), whereas other developers asked me
to patch also bytes, int and other types

hash(bytes) can be changed. But changing hash(int) may leak easily the
secret. We may use a different secret for each type, but if it is easy
to compute int hash secret, dictionaries using int are still
vulnerable.

--

There is no perfect solutions, drawbacks of each solution should be compared.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Antoine Pitrou
On Fri, 20 Jan 2012 17:17:24 +0100
Victor Stinner victor.stin...@haypocalc.com wrote:
  So I still think we should ditch the paranoia about dictionary order 
  changing,
  and fix this without counting.
 
 The randomized hash has other issues:
 
  - its security is based on its secret, whereas it looks to be easy to
 compute it (see more details in the issue)

How do you compute the secret? I see two possibilities:

- the application leaks the hash() values: this sounds unlikely since I
  don't see the use case for it;

- the application shows the dict iteration order (e.g. order of HTML
  attributes): then we could add a second per-dictionary secret so that
  the iteration order of a single dict doesn't give any useful
  information about the hash function.

But the bottom line for me is the following:

- randomized hashes eliminate the possibility to use a single exploit
  for all Python-powered applications: for each application, the
  attacker now has to find a way to extract the secret;

- collision counting doesn't eliminate the possibility of generic
  exploits, as Frank Sievertsen has just shown in
  http://mail.python.org/pipermail/python-dev/2012-January/115726.html

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Lennart Regebro
On Fri, Jan 20, 2012 at 01:48, Victor Stinner
victor.stin...@haypocalc.com wrote:
  - the limit should be configurable: a new function in the sys module
 should be enough. It may be private (or replaced by an environment
 variable?) in stable versions

I'd like to see both. I would like both the programmer and the user
to be able to control what the limit is.

//Lennart
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Thu, Jan 19, 2012 at 8:06 PM, Ivan Kozik i...@ludios.org wrote:

 No, I wasn't happy with termination.  I wanted to treat it just like a
 JSON decoding error, and send the appropriate response.


So was this attack actually being mounted on your service regularly? I'd
think it would be sufficient to treat it as a MemoryError -- unavoidable,
if it happens things are really bad, and hopefully you'll crash quickly and
some monitor process restarts your service. That's a mechanism that you
should have anyway.


 I actually forgot to mention the main reason I abandoned the
 stop-at-N-collisions approach.  I had a server with a dict that stayed
 in memory, across many requests.  It was being populated with
 identifiers chosen by clients.  I couldn't have my server stay broken
 if this dict filled up with a bunch of colliding keys.  (I don't think
 I could have done another thing either, like nuke the dict or evict
 some keys.)


What would your service do if it ran out of memory?

Maybe one tweak to the collision counting would be that the exception needs
to inherit from BaseException (like MemoryError) so most generic exception
handlers don't actually handle it. (Style note: never use except:, always
use except Exception:.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 1:57 AM, Frank Sievertsen py...@sievertsen.dewrote:

  The main issue with that approach is that it allows a new kind of attack.


 Indeed, I posted another example: http://bugs.python.org/msg151677

 This kind of fix can be used in a specific application or maybe in a
 special-purpose framework, but not on the level of a general-purpose
 language.


Right. We are discussion this issue (for weeks now...) because it makes
pretty much any Python app that takes untrusted data vulnerable, especially
web apps, and after extensive analysis we came to the conclusion that
defenses in the framework or in the app are really hard to do, very
disruptive for developers, whereas preventing the attack by a modification
of the dict or hash algorithms would fix it for everybody. And moreover,
the attack would work against pretty much any Python web app using a set of
evil strings computed once (hence encouraging script kiddies of just firing
their fully-automatic weapon at random websites).

The new attacks that are now being considered require analysis of how the
website is implemented, how it uses and stores data, etc. So an attacker
has to sit down and come up with an attack tailored to a specific website.
That can be dealt with on an ad-hoc basis.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 5:10 AM, Barry Warsaw ba...@python.org wrote:

 On Jan 20, 2012, at 01:50 PM, Victor Stinner wrote:

 Counting collision doesn't solve this case, but it doesn't make the
 situation worse than before. Raising quickly an exception is better
 than stalling for minutes, even if I agree than it is not the best
 behaviour.

 ISTM that adding the possibility of raising a new exception on dictionary
 insertion is *more* backward incompatible than changing dictionary order,
 which for a very long time has been known to not be guaranteed.  You're
 running some application, you upgrade Python because you apply all security
 fixes, and suddenly you're starting to get exceptions in places you can't
 really do anything about.  Yet those exceptions are now part of the
 documented
 public API for dictionaries.  This is asking for trouble.  Bugs will
 suddenly
 start appearing in that application's tracker and they will seem to the
 application developer like Python just added a new public API in a security
 release.


Dict insertion can already raise an exception: MemoryError. I think we
should be safe if the new exception also derives from BaseException. We
should actually eriously consider just raising MemoryException, since
introducing a new built-in exception in a bugfix release is also very
questionable: code explicitly catching or raising it would not work on
previous bugfix releases of the same feature release.

OTOH, if you change dictionary order and *that* breaks the application, then
 the bugs submitted to the application's tracker will be legitimate bugs
 that
 have to be fixed even if nothing else changed.


There are lots of things that are undefined according to the language spec
(and quite possibly known to vary between versions or platforms or
implementations like PyPy or Jython) but which we would never change in a
bugfix release.

So I still think we should ditch the paranoia about dictionary order
 changing,
 and fix this without counting.  A little bit of paranoia could creep back
 in
 by disabling the hash fix by default in stable releases, but I think it
 would
 be fine to make that a compile-time option.


I'm sorry, but I don't want to break a user's app with a bugfix release and
say haha your code was already broken you just didn't know it.

Sure, the dict order already varies across Python implementations, possibly
across 32/64 bits or operating systems. But many organizations (I know a
few :-) have a very large installed software base, created over many years
by many people with varying skills, that is kept working in part by very
carefully keeping the environment as constant as possible. This means that
the target environment is much more predictable than it is for the typical
piece of open source software.

Sure, a good Python developer doesn't write apps or tests that depend on
dict order. But time and again we see that not everybody writes perfect
code every time. Especially users writing in-house apps (as opposed to
frameworks shared as open source) are less likely to always use the most
robust, portable algorithms in existence, because they may know with much
more certainty that their code will never be used on certain combinations
of platforms. For example, I rarely think  about whether code I write might
not work on IronPython or Jython, or even CPython on Windows. And if
something I wrote suddenly needs to be ported to one of those, well, that's
considered a port and I'll just accept that it might mean changing a few
things.

The time to break a dependency on dict order is not with a bugfix release
but with a feature release: those are more likely to break other things as
well anyway, and uses are well aware that they have to test everything and
anticipate having to fix some fraction of their code for each feature
release. OTOH we have established a long and successful track record of
conservative bugfix releases that don't break anything. (I am aware of
exactly one thing that was broken by a bugfix release in application code I
am familiar with.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 5:20 AM, Barry Warsaw ba...@python.org wrote:

 Let's just be clear about it: this exception is new public API.  Changing
 dictionary order is not.


Not if you raise MemoryError or BaseException.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
This is the first objection I have seen against collision-counting that
might stand.

On Fri, Jan 20, 2012 at 7:55 AM, Frank Sievertsen py...@sievertsen.dewrote:

 Hello,

 I still see at least two ways to create a DOS attack even with the
 collison-counting-patch.

 I assumed that it's possible to send ~500KB of payload to the
 application.

 1. It's fully deterministic which slots the dict will lookup.
 Since we don't count slot-collisions, but only hash-value-collisions
 this can be exploited easily by creating strings with the hash-values
 along the lookup-way of an arbitrary (short) string.

 So first we pick an arbitrary string. Then calculate which slots it will
 visit on the way to the first empty slot. Then we create strings with
 hash-values for these slots.

 This attack first injects the strings to fill all the slots that the
 one short string will want to visit. Then it adds THE SAME string
 again and again. Since the entry is already there, nothing will be added
 and no additional collisions happen, no exception raised.

 $ ls -l super.txt
 -rw-r--r-- 1 fx5 fx5 52 20. Jan 10:19 super.txt
 $ tail -n3 super.txt
 FX5
 FX5
 FX5
 $ wc -l super.txt
 9 super.txt
 $ time python -c 'dict((unicode(l[:-1]), 0)  for l in open(super.txt))'
 real0m52.724s
 user0m51.543s
 sys0m0.028s

 2. The second attack actually attacks that 1000 allowed string
 comparisons are still a lot of work.
 First I added 999 strings that collide with a one-byte string a. In
 some applications a zero-byte string might work even better. Then I
 can add a many thousand of the a's, just like the first attack.

 $ ls -l 1000.txt
 -rw-r--r-- 1 fx5 fx5 50 20. Jan 16:15 1000.txt
 $ head -n 3 1000.txt
 7hLci00
 4wVFm10
 _rZJU50
 $ wc -l 1000.txt
 247000 1000.txt
 $ tail -n 3 1000.txt
 a
 a
 a
 $ time python -c 'dict((unicode(l[:-1]), 0)  for l in open(1000.txt))'
 real0m17.408s
 user0m15.897s
 sys0m0.008s

 Of course the first attack is far more efficient. One could argue
 that 16 seconds is not enough for an attack. But maybe it's possible
 to send 1MB, have zero-bytes strings, and since for example django
 does 5 lookups per query-string this will keep it busy for ~80 seconds on
 my pc.

 What to do now?
 I think it's not smart to reduce the number of allowed collisions
 dramatically
 AND count all slot-collisions at the same time.

 Frank

 __**_
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/**mailman/listinfo/python-devhttp://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
 guido%40python.orghttp://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Georg Brandl
Am 20.01.2012 19:15, schrieb Guido van Rossum:

 OTOH, if you change dictionary order and *that* breaks the application, 
 then
 the bugs submitted to the application's tracker will be legitimate bugs 
 that
 have to be fixed even if nothing else changed.
 
 
 There are lots of things that are undefined according to the language spec 
 (and
 quite possibly known to vary between versions or platforms or implementations
 like PyPy or Jython) but which we would never change in a bugfix release.
 
 So I still think we should ditch the paranoia about dictionary order 
 changing,
 and fix this without counting.  A little bit of paranoia could creep back 
 in
 by disabling the hash fix by default in stable releases, but I think it 
 would
 be fine to make that a compile-time option.
 
 
 I'm sorry, but I don't want to break a user's app with a bugfix release and 
 say
 haha your code was already broken you just didn't know it.
 
 Sure, the dict order already varies across Python implementations, possibly
 across 32/64 bits or operating systems. But many organizations (I know a few 
 :-)
 have a very large installed software base, created over many years by many
 people with varying skills, that is kept working in part by very carefully
 keeping the environment as constant as possible. This means that the target
 environment is much more predictable than it is for the typical piece of open
 source software.

I agree.  This applies to 3.2 and 2.7, but even more to 3.1 and 2.6, which are
in security-fix mode.

Even if relying on dict order is a bug right now, I believe it happens many
times more often in code bases out there than dicts that are filled with many
many colliding keys.  So even if we can honestly blame the programmer in the
former case, the users applying the security fix will have the same bad
experience and won't likely care if we claim undefined behavior.  This means
that it seems preferable to go with the situation where you have less breakages
in total.

Not to mention that changing dict order is likely to lead to much more subtle
bugs than a straight MemoryError on a dict insert.

Also, another advantage of collision counting I haven't seen in the discussion
yet is that it's quite trivial to provide an API in sys to turn it on or off --
while turning on or off randomized hashes has to be done before Python starts
up, i.e. at build time or with an environment variable or flag.

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Brett Cannon
On Fri, Jan 20, 2012 at 13:15, Guido van Rossum gu...@python.org wrote:

 On Fri, Jan 20, 2012 at 5:10 AM, Barry Warsaw ba...@python.org wrote:

 On Jan 20, 2012, at 01:50 PM, Victor Stinner wrote:

 Counting collision doesn't solve this case, but it doesn't make the
 situation worse than before. Raising quickly an exception is better
 than stalling for minutes, even if I agree than it is not the best
 behaviour.

 ISTM that adding the possibility of raising a new exception on dictionary
 insertion is *more* backward incompatible than changing dictionary order,
 which for a very long time has been known to not be guaranteed.  You're
 running some application, you upgrade Python because you apply all
 security
 fixes, and suddenly you're starting to get exceptions in places you can't
 really do anything about.  Yet those exceptions are now part of the
 documented
 public API for dictionaries.  This is asking for trouble.  Bugs will
 suddenly
 start appearing in that application's tracker and they will seem to the
 application developer like Python just added a new public API in a
 security
 release.


 Dict insertion can already raise an exception: MemoryError. I think we
 should be safe if the new exception also derives from BaseException. We
 should actually eriously consider just raising MemoryException, since
 introducing a new built-in exception in a bugfix release is also very
 questionable: code explicitly catching or raising it would not work on
 previous bugfix releases of the same feature release.

 OTOH, if you change dictionary order and *that* breaks the application,
 then
 the bugs submitted to the application's tracker will be legitimate bugs
 that
 have to be fixed even if nothing else changed.


 There are lots of things that are undefined according to the language spec
 (and quite possibly known to vary between versions or platforms or
 implementations like PyPy or Jython) but which we would never change in a
 bugfix release.

 So I still think we should ditch the paranoia about dictionary order
 changing,
 and fix this without counting.  A little bit of paranoia could creep back
 in
 by disabling the hash fix by default in stable releases, but I think it
 would
 be fine to make that a compile-time option.


 I'm sorry, but I don't want to break a user's app with a bugfix release
 and say haha your code was already broken you just didn't know it.

 Sure, the dict order already varies across Python implementations,
 possibly across 32/64 bits or operating systems. But many organizations (I
 know a few :-) have a very large installed software base, created over many
 years by many people with varying skills, that is kept working in part by
 very carefully keeping the environment as constant as possible. This means
 that the target environment is much more predictable than it is for the
 typical piece of open source software.

 Sure, a good Python developer doesn't write apps or tests that depend on
 dict order. But time and again we see that not everybody writes perfect
 code every time. Especially users writing in-house apps (as opposed to
 frameworks shared as open source) are less likely to always use the most
 robust, portable algorithms in existence, because they may know with much
 more certainty that their code will never be used on certain combinations
 of platforms. For example, I rarely think  about whether code I write might
 not work on IronPython or Jython, or even CPython on Windows. And if
 something I wrote suddenly needs to be ported to one of those, well, that's
 considered a port and I'll just accept that it might mean changing a few
 things.

 The time to break a dependency on dict order is not with a bugfix release
 but with a feature release: those are more likely to break other things as
 well anyway, and uses are well aware that they have to test everything and
 anticipate having to fix some fraction of their code for each feature
 release. OTOH we have established a long and successful track record of
 conservative bugfix releases that don't break anything. (I am aware of
 exactly one thing that was broken by a bugfix release in application code I
 am familiar with.)


Why can't we have our cake and eat it too?

Can we do hash randomization in 3.3 and use the hash count solution for
bugfix releases? That way we get a basic fix into the bugfix releases that
won't break people's code (hopefully) but we go with a more thorough (and
IMO correct) solution of hash randomization starting with 3.3 and moving
forward. We aren't breaking compatibility in any way by doing this since
it's a feature release anyway where we change tactics. And it can't be that
much work since we seem to have patches for both solutions. At worst it
will make merging commits for those files affected by the patches, but that
will most likely be isolated and not a common collision (and less of any
issue once 3.3 is released later this year).

I understand the desire to keep backwards-compatibility, 

Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Terry Reedy

On 1/20/2012 11:17 AM, Victor Stinner wrote:


There is no perfect solutions, drawbacks of each solution should be compared.


Amen.

One possible attack that has been described for a collision counting 
dict depends on knowing precisely the trigger point. So let 
MAXCOLLISIONS either be configureable or just choose a random count 
between M and N, say 700 and 999.


It would not hurt to have alternate patches available in case a 
particular Python-powered site comes under prolonged attack. Though 
given our miniscule share of the market, than is much less likely that 
an attack on a PHP- or MS-powered site.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Donald Stufft
Even if a MemoryException is raised I believe that is still a fundamental 
change in the documented contract of dictionary API. I don't believe there is a 
way to fix this without breaking someones application. The major differences I 
see between the two solutions is that counting will break people's applications 
who are otherwise following the documented api contract of dictionaries, and 
randomization will break people's applications who are violating the documented 
api contract of dictionaries. 

Personally I feel that the lesser of two evils is to reward those who followed 
the documentation, and not reward those who didn't.

So +1 for Randomization as the only option in 3.3, and off by default with a 
flag or environment variable in bug fixes. I think it's the only way to proceed 
that won't hurt people who have followed the documented behavior. 


On Friday, January 20, 2012 at 1:49 PM, Brett Cannon wrote:

 
 
 On Fri, Jan 20, 2012 at 13:15, Guido van Rossum gu...@python.org 
 (mailto:gu...@python.org) wrote:
  On Fri, Jan 20, 2012 at 5:10 AM, Barry Warsaw ba...@python.org 
  (mailto:ba...@python.org) wrote:
   On Jan 20, 2012, at 01:50 PM, Victor Stinner wrote:
   
   Counting collision doesn't solve this case, but it doesn't make the
   situation worse than before. Raising quickly an exception is better
   than stalling for minutes, even if I agree than it is not the best
   behaviour.
   
   ISTM that adding the possibility of raising a new exception on dictionary
   insertion is *more* backward incompatible than changing dictionary order,
   which for a very long time has been known to not be guaranteed.  You're
   running some application, you upgrade Python because you apply all 
   security
   fixes, and suddenly you're starting to get exceptions in places you can't
   really do anything about.  Yet those exceptions are now part of the 
   documented
   public API for dictionaries.  This is asking for trouble.  Bugs will 
   suddenly
   start appearing in that application's tracker and they will seem to the
   application developer like Python just added a new public API in a 
   security
   release.
  
  Dict insertion can already raise an exception: MemoryError. I think we 
  should be safe if the new exception also derives from BaseException. We 
  should actually eriously consider just raising MemoryException, since 
  introducing a new built-in exception in a bugfix release is also very 
  questionable: code explicitly catching or raising it would not work on 
  previous bugfix releases of the same feature release.
  
   OTOH, if you change dictionary order and *that* breaks the application, 
   then
   the bugs submitted to the application's tracker will be legitimate bugs 
   that
   have to be fixed even if nothing else changed.
  
  There are lots of things that are undefined according to the language spec 
  (and quite possibly known to vary between versions or platforms or 
  implementations like PyPy or Jython) but which we would never change in a 
  bugfix release.
  
   So I still think we should ditch the paranoia about dictionary order 
   changing,
   and fix this without counting.  A little bit of paranoia could creep back 
   in
   by disabling the hash fix by default in stable releases, but I think it 
   would
   be fine to make that a compile-time option.
  
  I'm sorry, but I don't want to break a user's app with a bugfix release and 
  say haha your code was already broken you just didn't know it.
  
  Sure, the dict order already varies across Python implementations, possibly 
  across 32/64 bits or operating systems. But many organizations (I know a 
  few :-) have a very large installed software base, created over many years 
  by many people with varying skills, that is kept working in part by very 
  carefully keeping the environment as constant as possible. This means that 
  the target environment is much more predictable than it is for the typical 
  piece of open source software.
  
  Sure, a good Python developer doesn't write apps or tests that depend on 
  dict order. But time and again we see that not everybody writes perfect 
  code every time. Especially users writing in-house apps (as opposed to 
  frameworks shared as open source) are less likely to always use the most 
  robust, portable algorithms in existence, because they may know with much 
  more certainty that their code will never be used on certain combinations 
  of platforms. For example, I rarely think  about whether code I write might 
  not work on IronPython or Jython, or even CPython on Windows. And if 
  something I wrote suddenly needs to be ported to one of those, well, that's 
  considered a port and I'll just accept that it might mean changing a few 
  things.
  
  The time to break a dependency on dict order is not with a bugfix release 
  but with a feature release: those are more likely to break other things as 
  well anyway, and uses are well aware that they have to test 

Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Case Van Horsen
On Fri, Jan 20, 2012 at 8:17 AM, Victor Stinner
victor.stin...@haypocalc.com wrote:
 So I still think we should ditch the paranoia about dictionary order 
 changing,
 and fix this without counting.

 The randomized hash has other issues:

  - its security is based on its secret, whereas it looks to be easy to
 compute it (see more details in the issue)
  - my patch only changes hash(str), whereas other developers asked me
 to patch also bytes, int and other types

Changing hash(int) on a bugfix release will cause issues with
extensions (gmpy, sage, probably others) that calculate the hash of
numerical objects.


 hash(bytes) can be changed. But changing hash(int) may leak easily the
 secret. We may use a different secret for each type, but if it is easy
 to compute int hash secret, dictionaries using int are still
 vulnerable.

 --

 There is no perfect solutions, drawbacks of each solution should be compared.

 Victor
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/casevh%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Terry Reedy

On 1/20/2012 10:55 AM, Frank Sievertsen wrote:

Hello,

I still see at least two ways to create a DOS attack even with the
collison-counting-patch.



2. The second attack actually attacks that 1000 allowed string
comparisons are still a lot of work.
First I added 999 strings that collide with a one-byte string a. In
some applications a zero-byte string might work even better. Then I
can add a many thousand of the a's, just like the first attack.


If 1000 were replaced by, for instance, random.randint(700,1000) the 
dict could not be set to have an exception triggered with one other 
entry (which I believe was Martin's idea). But I suppose you would say 
that 699 entries would still make for much work.


The obvious defense for this particular attack is to reject duplicate 
keys. Perhaps there should be write-once string sets and dicts available.


This gets to the point that there is no best blind defense to all 
possible attacks.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 01/20/2012 02:04 PM, Donald Stufft wrote:

 Even if a MemoryException is raised I believe that is still a 
 fundamental change in the documented contract of dictionary API.

How so?  Dictionary inserts can *already* raise that error.

 I don't believe there is a way to fix this without breaking someones 
 application. The major differences I see between the two solutions is
  that counting will break people's applications who are otherwise 
 following the documented api contract of dictionaries,

Do you have a case in mind where legitimate user data (not crafted as
part of a DoS attack) would trip the 1000-collision limit?  How likely is
it that such cases exist in already-deployed applications, compared to
the known breakage in existing applications due to hash randomization?

 and randomization will break people's applications who are violating 
 the documented api contract of dictionaries.
 
 Personally I feel that the lesser of two evils is to reward those who
  followed the documentation, and not reward those who didn't.

Except that I think your set is purely hypothetical, while the second set
is *lots* of deployed applications.


Tres.
- -- 
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8ZwlgACgkQ+gerLs4ltQ4KOACglAHDgn5wUb+cye99JbeW0rZo
5oAAn2ja7K4moFLN/aD4ZP7m+8WnwhcA
=u7Mt
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Donald Stufft


On Friday, January 20, 2012 at 2:36 PM, Tres Seaver wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 01/20/2012 02:04 PM, Donald Stufft wrote:
 
  Even if a MemoryException is raised I believe that is still a 
  fundamental change in the documented contract of dictionary API.
  
 
 
 How so? Dictionary inserts can *already* raise that error.
Because it's raising it for a fundamentally different thing. You have plenty 
of memory, but we decided to add an arbitrary limit that has nothing to do with 
memory and pretend you are out of memory anyways.
 
  I don't believe there is a way to fix this without breaking someones 
  application. The major differences I see between the two solutions is
  that counting will break people's applications who are otherwise 
  following the documented api contract of dictionaries,
  
 
 
 Do you have a case in mind where legitimate user data (not crafted as
 part of a DoS attack) would trip the 1000-collision limit? How likely is
 it that such cases exist in already-deployed applications, compared to
 the known breakage in existing applications due to hash randomization?
 
 

I don't, but as there's never been a limit on how many collisions a dictionary 
can have, this would be a fundamental change in the documented (and 
undocumented) abilities of a dictionary. Dictionary key order has never been 
guaranteed, is documented to not be relied on, already changes depending on if 
you are using 32bit, 64bit, Jython, PyPy etc or as someone else pointed out, to 
any number of possible improvements to dict. The counting solution violates the 
existing contract in order to serve people who themselves are violating the 
contract. Even with their violation the method that I +1'd still serves to not 
break existing applications by default.
 
  and randomization will break people's applications who are violating 
  the documented api contract of dictionaries.
  
  Personally I feel that the lesser of two evils is to reward those who
  followed the documentation, and not reward those who didn't.
  
 
 
 Except that I think your set is purely hypothetical, while the second set
 is *lots* of deployed applications.
 
 

Which is why I believe that it should be off by default on the bugfix, but 
easily enabled. (Flag, env var, whatever). That allows people to upgrade to a 
bugfix without breaking their application, and if this vulnerability affects 
them, they can enable it.

I think the counting collision is at best a bandaid and not a proper fix 
stemmed from a desire to not break existing applications on a bugfix release 
which can be better solved by implementing the real fix and allowing people to 
control (only on the bugfix, on 3.3+ it should be forced to on always) if they 
have it enabled or not.
 
 
 Tres.
 - -- 
 ===
 Tres Seaver +1 540-429-0999 tsea...@palladion.com 
 (mailto:tsea...@palladion.com)
 Palladion Software Excellence by Design http://palladion.com
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
 iEYEARECAAYFAk8ZwlgACgkQ+gerLs4ltQ4KOACglAHDgn5wUb+cye99JbeW0rZo
 5oAAn2ja7K4moFLN/aD4ZP7m+8WnwhcA
 =u7Mt
 -END PGP SIGNATURE-
 
 ___
 Python-Dev mailing list
 Python-Dev@python.org (mailto:Python-Dev@python.org)
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/donald.stufft%40gmail.com
 
 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Ethan Furman

Donald Stufft wrote:
Even if a MemoryException is raised I believe that is still a 
fundamental change in the documented contract of dictionary API. I don't 
believe there is a way to fix this without breaking someones 
application. The major differences I see between the two solutions is 
that counting will break people's applications who are otherwise 
following the documented api contract of dictionaries, and randomization 
will break people's applications who are violating the documented api 
contract of dictionaries. 

Personally I feel that the lesser of two evils is to reward those who 
followed the documentation, and not reward those who didn't.


So +1 for Randomization as the only option in 3.3, and off by default 
with a flag or environment variable in bug fixes. I think it's the only 
way to proceed that won't hurt people who have followed the documented 
behavior.


+1

~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 11:51 AM, Donald Stufft donald.stu...@gmail.comwrote:

  On Friday, January 20, 2012 at 2:36 PM, Tres Seaver wrote:

 On 01/20/2012 02:04 PM, Donald Stufft wrote:

 Even if a MemoryException is raised I believe that is still a
 fundamental change in the documented contract of dictionary API.

 How so? Dictionary inserts can *already* raise that error.

 Because it's raising it for a fundamentally different thing. You have
 plenty of memory, but we decided to add an arbitrary limit that has nothing
 to do with memory and pretend you are out of memory anyways.


Actually due to fragmentation that can already happen.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Terry Reedy

On 1/20/2012 2:51 PM, Donald Stufft wrote:


I think the counting collision is at best a bandaid and not a proper fix
stemmed from a desire to not break existing applications on a bugfix
release ...


My opinion of counting is better than yours, but even conceding the 
theoretical, purity argument, our release process is practical as well. 
There have been a few occasions when fixes to bugs in our code have been 
delayed from a bugfix release to the next feature release -- because the 
fix would break too much code depending on the bug.


Some years ago there was a proposal that we should deliberately tweak 
hash() to break 'buggy' code that depended on it not changing. This 
never happened. So it has been left de facto constant, to the extent it 
is, for some years.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Ben Wolfson
On Fri, Jan 20, 2012 at 2:11 PM, Terry Reedy tjre...@udel.edu wrote:
 On 1/20/2012 2:51 PM, Donald Stufft wrote:

 I think the counting collision is at best a bandaid and not a proper fix
 stemmed from a desire to not break existing applications on a bugfix
 release ...

 My opinion of counting is better than yours, but even conceding the
 theoretical, purity argument, our release process is practical as well.
 There have been a few occasions when fixes to bugs in our code have been
 delayed from a bugfix release to the next feature release -- because the fix
 would break too much code depending on the bug.

AFAICT Brett's suggestion (which had occurred to me as well, but I'm
not a core developer by any stretch) seemed to get lost in the debate:
would it be possible to go with collision counting for bugfix releases
and hash randomization for new feature releases? (Brett made it here:
http://mail.python.org/pipermail/python-dev/2012-January/115740.html.)

-- 
Ben Wolfson
Human kind has used its intelligence to vary the flavour of drinks,
which may be sweet, aromatic, fermented or spirit-based. ... Family
and social life also offer numerous other occasions to consume drinks
for pleasure. [Larousse, Drink entry]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Donald Stufft
I believe that either solution has the potential to break existing applications 
so to ensure that no applications are broken there will need to be a method of 
disabling it. I also believe that to maintain the backwards compatibility that 
Python has traditionally had in bug fix releases that either solution will need 
to default to off. 

Given those 2 things that I believe, I don't think that the argument should be 
which solution will break less, because I believe both will need to be off by 
default, but which solution more adequately solves the underlying problem. 


On Friday, January 20, 2012 at 5:11 PM, Terry Reedy wrote:

 On 1/20/2012 2:51 PM, Donald Stufft wrote:
 
  I think the counting collision is at best a bandaid and not a proper fix
  stemmed from a desire to not break existing applications on a bugfix
  release ...
  
 
 
 My opinion of counting is better than yours, but even conceding the 
 theoretical, purity argument, our release process is practical as well. 
 There have been a few occasions when fixes to bugs in our code have been 
 delayed from a bugfix release to the next feature release -- because the 
 fix would break too much code depending on the bug.
 
 Some years ago there was a proposal that we should deliberately tweak 
 hash() to break 'buggy' code that depended on it not changing. This 
 never happened. So it has been left de facto constant, to the extent it 
 is, for some years.
 
 -- 
 Terry Jan Reedy
 
 ___
 Python-Dev mailing list
 Python-Dev@python.org (mailto:Python-Dev@python.org)
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 http://mail.python.org/mailman/options/python-dev/donald.stufft%40gmail.com
 
 


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Frank Sievertsen

Am 20.01.2012 16:33, schrieb Guido van Rossum:
(I'm thinking that the original attack is trivial once the set of 
65000 colliding keys is public knowledge, which must be only a matter 
of time.



I think it's very likely that this will happen soon.

For ASP and PHP there is attack-payload publicly available.
PHP and ASP have patches to limit the number of query-variables.

We're very lucky that there's no public payload for python yet,
and all non-public software and payload I'm aware of is based
upon my software.

But this can change any moment. It's not really difficult to
write software to create 32bit-collisions.

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 2:33 PM, Ben Wolfson wolf...@gmail.com wrote:
 On Fri, Jan 20, 2012 at 2:11 PM, Terry Reedy tjre...@udel.edu wrote:
 On 1/20/2012 2:51 PM, Donald Stufft wrote:

 I think the counting collision is at best a bandaid and not a proper fix
 stemmed from a desire to not break existing applications on a bugfix
 release ...

 My opinion of counting is better than yours, but even conceding the
 theoretical, purity argument, our release process is practical as well.
 There have been a few occasions when fixes to bugs in our code have been
 delayed from a bugfix release to the next feature release -- because the fix
 would break too much code depending on the bug.

 AFAICT Brett's suggestion (which had occurred to me as well, but I'm
 not a core developer by any stretch) seemed to get lost in the debate:
 would it be possible to go with collision counting for bugfix releases
 and hash randomization for new feature releases? (Brett made it here:
 http://mail.python.org/pipermail/python-dev/2012-January/115740.html.)

I made it earlier.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 2:35 PM, Frank Sievertsen py...@sievertsen.de wrote:
 Am 20.01.2012 16:33, schrieb Guido van Rossum:

 (I'm thinking that the original attack is trivial once the set of 65000
 colliding keys is public knowledge, which must be only a matter of time.



 I think it's very likely that this will happen soon.

 For ASP and PHP there is attack-payload publicly available.
 PHP and ASP have patches to limit the number of query-variables.

 We're very lucky that there's no public payload for python yet,
 and all non-public software and payload I'm aware of is based
 upon my software.

 But this can change any moment. It's not really difficult to
 write software to create 32bit-collisions.

While we're debating the best fix, could we allow people to at least
protect themselves against script-kiddies by offering fixes to cgi.py,
django, webob and a few other popular frameworks that limits forms to
1000 keys? (I suppose it's really only POST requests that are
vulnerable to script kiddies, because of the length restriction on
URLs.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Steven D'Aprano

Guido van Rossum wrote:

On Fri, Jan 20, 2012 at 11:51 AM, Donald Stufft donald.stu...@gmail.comwrote:


 On Friday, January 20, 2012 at 2:36 PM, Tres Seaver wrote:

On 01/20/2012 02:04 PM, Donald Stufft wrote:

Even if a MemoryException is raised I believe that is still a
fundamental change in the documented contract of dictionary API.

How so? Dictionary inserts can *already* raise that error.

Because it's raising it for a fundamentally different thing. You have
plenty of memory, but we decided to add an arbitrary limit that has nothing
to do with memory and pretend you are out of memory anyways.



Actually due to fragmentation that can already happen.


Whether you have run out of total memory, or a single contiguous block, it is 
still a memory error.


An arbitrary limit on collisions is not a memory error. If we were designing 
this API from scratch, would anyone propose using MemoryError for you have 
reached a limit on collisions? It has nothing to do with memory. Using 
MemoryError for something which isn't a memory error is ugly.


How about RuntimeError?



--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
It should derive from BaseException.

--Guido van Rossum (sent from Android phone)
On Jan 20, 2012 4:59 PM, Steven Dapos;Aprano st...@pearwood.info wrote:

 Guido van Rossum wrote:

 On Fri, Jan 20, 2012 at 11:51 AM, Donald Stufft donald.stu...@gmail.com
 **wrote:

   On Friday, January 20, 2012 at 2:36 PM, Tres Seaver wrote:

 On 01/20/2012 02:04 PM, Donald Stufft wrote:

 Even if a MemoryException is raised I believe that is still a
 fundamental change in the documented contract of dictionary API.

 How so? Dictionary inserts can *already* raise that error.

 Because it's raising it for a fundamentally different thing. You have
 plenty of memory, but we decided to add an arbitrary limit that has
 nothing
 to do with memory and pretend you are out of memory anyways.


 Actually due to fragmentation that can already happen.


 Whether you have run out of total memory, or a single contiguous block, it
 is still a memory error.

 An arbitrary limit on collisions is not a memory error. If we were
 designing this API from scratch, would anyone propose using MemoryError for
 you have reached a limit on collisions? It has nothing to do with memory.
 Using MemoryError for something which isn't a memory error is ugly.

 How about RuntimeError?



 --
 Steven

 __**_
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/**mailman/listinfo/python-devhttp://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: http://mail.python.org/**mailman/options/python-dev/**
 guido%40python.orghttp://mail.python.org/mailman/options/python-dev/guido%40python.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Steven D'Aprano

Terry Reedy wrote:

On 1/20/2012 11:17 AM, Victor Stinner wrote:

There is no perfect solutions, drawbacks of each solution should be 
compared.


Amen.

One possible attack that has been described for a collision counting 
dict depends on knowing precisely the trigger point. So let 
MAXCOLLISIONS either be configureable or just choose a random count 
between M and N, say 700 and 999.


Have I missed something? Why wouldn't the attacker simply target 1000 
collisions, and if the collision triggers at 700 instead of 1000, that's a bonus?



--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Steven D'Aprano

Guido van Rossum wrote:

It should derive from BaseException.


RuntimeError meets that requirement, and it is an existing exception so there 
are no issues with introducing a new built-in exception to a point release.


py issubclass(RuntimeError, BaseException)
True


--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Benjamin Peterson
2012/1/20 Steven D'Aprano st...@pearwood.info:
 Guido van Rossum wrote:

 It should derive from BaseException.


 RuntimeError meets that requirement, and it is an existing exception so
 there are no issues with introducing a new built-in exception to a point
 release.

 py issubclass(RuntimeError, BaseException)
 True

Guido meant a direct subclass.



-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-20 Thread Guido van Rossum
On Fri, Jan 20, 2012 at 6:33 PM, Steven D'Aprano st...@pearwood.info wrote:
 Guido van Rossum wrote:

 It should derive from BaseException.

 RuntimeError meets that requirement, and it is an existing exception so
 there are no issues with introducing a new built-in exception to a point
 release.

 py issubclass(RuntimeError, BaseException)
 True

Sorry, I was ambiguous. I meant it should not derive from Exception.
It goes RuntimeError - StandardError - Exception - BaseException.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Counting collisions for the win

2012-01-19 Thread Victor Stinner
Hi,

I'm working on the hash collision issue since 2 or 3 weeks. I
evaluated all solutions and I think that I have now a good knowledge
of the problem and how it should be solved. The major issue is to have
a minor or no impact on applications (don't break backward
compatibility). I saw three major solutions:

 - use a randomized hash
 - use two hashes, a randomized hash and the actual hash kept for
backward compatibility
 - count collisions on dictionary lookup

Using a randomized hash does break a lot of tests (e.g. tests relying
on the representation of a dictionary). The patch is huge, too big to
backport it directly on stable versions. Using a randomized hash may
also break (indirectly) real applications because the application
output is also somehow randomized. For example, in the Django test
suite, the HTML output is different at each run. Web browsers may
render the web page differently, or crash, or ... I don't think that
Django would like to sort attributes of each HTML tag, just because we
wanted to fix a vulnerability.

Randomized hash has also a major issue: if the attacker is able to
compute the secret, (s)he can easily compute collisions and exploit
the hash collision vulnerability again. I don't know exactly how
complex it is to compute the secret, but our hash function is weak (it
is far from being cryptographic, it is really simple to run it
backward). If someone writes a fast function to compute the secret, we
will go back to the same point.

IMO using two hashes has the same disavantages of the randomized hash
solution, whereas it is more complex to implement.

The last solution is very simple: count collision and raise an
exception if it hits a limit. The path is something like 10 lines
whereas the randomized hash is more close to 500 lines, add a new
file, change Visual Studio project file, etc. First I thaught that it
would break more applications than the randomized hash, but I tried on
Django: the test suite fails with a limit of 20 collisions, but not
with a limit of 50 collisions, whereas the patch uses a limit of 1000
collisions. According to my basic tests, a limit of 35 collisions
requires a dictionary with more than 10,000,000 integer keys to raise
an error. I am not talking about the attack, but valid data.

More details about my tests on the Django test suite:
http://bugs.python.org/issue13703#msg151620

--

I propose to solve the hash collision vulnerability by counting
collisions because it does fix the vulnerability with a minor or no
impact on applications or backward compatibility. I don't see why we
should use a different fix for Python 3.3. If counting collisons
solves the issue for stable versions, it is also enough for Python
3.3. We now know all issues of the randomized hash solution, and I
think that there are more drawbacks than advantages. IMO the
randomized hash is overkill to fix the hash collision issue.

I just have some requests on Marc Andre Lemburg patch:

 - the limit should be configurable: a new function in the sys module
should be enough. It may be private (or replaced by an environment
variable?) in stable versions
 - the set type should also be patched (I didn't check if it is
vulnerable or not using the patch)
 - the patch has no test! (a class with a fixed hash should be enough
to write a test)
 - the limit must be documented somwhere
 - the exception type should be different than KeyError

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Guido van Rossum
On Thu, Jan 19, 2012 at 4:48 PM, Victor Stinner 
victor.stin...@haypocalc.com wrote:

 Hi,

 I'm working on the hash collision issue since 2 or 3 weeks. I
 evaluated all solutions and I think that I have now a good knowledge
 of the problem and how it should be solved. The major issue is to have
 a minor or no impact on applications (don't break backward
 compatibility). I saw three major solutions:

  - use a randomized hash
  - use two hashes, a randomized hash and the actual hash kept for
 backward compatibility
  - count collisions on dictionary lookup

 Using a randomized hash does break a lot of tests (e.g. tests relying
 on the representation of a dictionary). The patch is huge, too big to
 backport it directly on stable versions. Using a randomized hash may
 also break (indirectly) real applications because the application
 output is also somehow randomized. For example, in the Django test
 suite, the HTML output is different at each run. Web browsers may
 render the web page differently, or crash, or ... I don't think that
 Django would like to sort attributes of each HTML tag, just because we
 wanted to fix a vulnerability.

 Randomized hash has also a major issue: if the attacker is able to
 compute the secret, (s)he can easily compute collisions and exploit
 the hash collision vulnerability again. I don't know exactly how
 complex it is to compute the secret, but our hash function is weak (it
 is far from being cryptographic, it is really simple to run it
 backward). If someone writes a fast function to compute the secret, we
 will go back to the same point.

 IMO using two hashes has the same disavantages of the randomized hash
 solution, whereas it is more complex to implement.

 The last solution is very simple: count collision and raise an
 exception if it hits a limit. The path is something like 10 lines
 whereas the randomized hash is more close to 500 lines, add a new
 file, change Visual Studio project file, etc. First I thaught that it
 would break more applications than the randomized hash, but I tried on
 Django: the test suite fails with a limit of 20 collisions, but not
 with a limit of 50 collisions, whereas the patch uses a limit of 1000
 collisions. According to my basic tests, a limit of 35 collisions
 requires a dictionary with more than 10,000,000 integer keys to raise
 an error. I am not talking about the attack, but valid data.

 More details about my tests on the Django test suite:
 http://bugs.python.org/issue13703#msg151620

 --

 I propose to solve the hash collision vulnerability by counting
 collisions because it does fix the vulnerability with a minor or no
 impact on applications or backward compatibility. I don't see why we
 should use a different fix for Python 3.3. If counting collisons
 solves the issue for stable versions, it is also enough for Python
 3.3. We now know all issues of the randomized hash solution, and I
 think that there are more drawbacks than advantages. IMO the
 randomized hash is overkill to fix the hash collision issue.


+1


 I just have some requests on Marc Andre Lemburg patch:

  - the limit should be configurable: a new function in the sys module
 should be enough. It may be private (or replaced by an environment
 variable?) in stable versions
  - the set type should also be patched (I didn't check if it is
 vulnerable or not using the patch)
  - the patch has no test! (a class with a fixed hash should be enough
 to write a test)
  - the limit must be documented somwhere
  - the exception type should be different than KeyError

 Victor
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Ivan Kozik
On Fri, Jan 20, 2012 at 00:48, Victor Stinner
victor.stin...@haypocalc.com wrote:
 I propose to solve the hash collision vulnerability by counting
 collisions because it does fix the vulnerability with a minor or no
 impact on applications or backward compatibility. I don't see why we
 should use a different fix for Python 3.3. If counting collisons
 solves the issue for stable versions, it is also enough for Python
 3.3. We now know all issues of the randomized hash solution, and I
 think that there are more drawbacks than advantages. IMO the
 randomized hash is overkill to fix the hash collision issue.

I'd like to point out that an attacker is not limited to sending just
one dict full of colliding keys.  Given a 22ms stall for a dict full
of 1000 colliding keys, and 100 such objects inside a parent object
(perhaps JSON), you can stall a server for 2.2+ seconds.  Going with
the raise-at-1000 approach doesn't solve the problem for everyone.

In addition, because the raise-at-N-collisions approach raises an
exception, everyone who wants to handle this error condition properly
has to change their code to catch a previously-unexpected exception.
(I know they're usually still better off with the fix, but why force
many people to change code when you can actually fix the hashing
problem?)

Another issue is that even with a configurable limit, different
modules can't have their own limits.  One module might want a
relatively safe raise-at-100, and another module creating massive
dicts might want raise-at-1000.  How does a developer know whether
they can raise or lower the limit, given that they use a bunch of
different modules?

I actually went with this stop-at-N-collisions approach by patching my
CPython a few years ago, where I limiting dictobject and setobject's
critical `for` loop to 100 iterations (I realize this might handle
fewer than 100 collisions.)  This worked fine until I tried to compile
PyPy, where the translator blew up due to a massive dict.  This,
combined with the second problem (needing to catch an exception), led
me to abandon this approach and write Securetypes, which has a
securedict that uses SHA-1.  Not that I like this either; I think I'm
happy with the randomize-hash() approach.

Ivan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Guido van Rossum
On Thu, Jan 19, 2012 at 7:32 PM, Ivan Kozik i...@ludios.org wrote:

 On Fri, Jan 20, 2012 at 00:48, Victor Stinner
 victor.stin...@haypocalc.com wrote:
  I propose to solve the hash collision vulnerability by counting
  collisions because it does fix the vulnerability with a minor or no
  impact on applications or backward compatibility. I don't see why we
  should use a different fix for Python 3.3. If counting collisons
  solves the issue for stable versions, it is also enough for Python
  3.3. We now know all issues of the randomized hash solution, and I
  think that there are more drawbacks than advantages. IMO the
  randomized hash is overkill to fix the hash collision issue.

 I'd like to point out that an attacker is not limited to sending just
 one dict full of colliding keys.  Given a 22ms stall for a dict full
 of 1000 colliding keys, and 100 such objects inside a parent object
 (perhaps JSON), you can stall a server for 2.2+ seconds.  Going with
 the raise-at-1000 approach doesn't solve the problem for everyone.


It's just a DoS attack. Those won't go away. We just need to raise the
effort needed for the attacker. The original attack would cause something
like 5 minutes of CPU usage per request (with a set of colliding keys that
could be computed once and used to attack every Python-run website in the
world). That's at least 2 orders of magnitude worse.

In addition, because the raise-at-N-collisions approach raises an
 exception, everyone who wants to handle this error condition properly
 has to change their code to catch a previously-unexpected exception.
 (I know they're usually still better off with the fix, but why force
 many people to change code when you can actually fix the hashing
 problem?)


Why would anybody need to change their code? Every web framework worth its
salt has a top-level error catcher that logs the error, serves a 500
response, and possibly does other things like email the admin.


 Another issue is that even with a configurable limit, different
 modules can't have their own limits.  One module might want a
 relatively safe raise-at-100, and another module creating massive
 dicts might want raise-at-1000.  How does a developer know whether
 they can raise or lower the limit, given that they use a bunch of
 different modules?


I don't think it needs to be configurable. There just needs to be a way to
turn it off.


 I actually went with this stop-at-N-collisions approach by patching my
 CPython a few years ago, where I limiting dictobject and setobject's
 critical `for` loop to 100 iterations (I realize this might handle
 fewer than 100 collisions.)  This worked fine until I tried to compile
 PyPy, where the translator blew up due to a massive dict.


I think that's because your collision-counting algorithm was much more
primitive than MAL's.


 This,
 combined with the second problem (needing to catch an exception), led
 me to abandon this approach and write Securetypes, which has a
 securedict that uses SHA-1.  Not that I like this either; I think I'm
 happy with the randomize-hash() approach.


Why did you need to catch the exception? Were you not happy with the
program simply terminating with a traceback when it got attacked?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Steven D'Aprano

Victor Stinner wrote:


The last solution is very simple: count collision and raise an
exception if it hits a limit. ...
According to my basic tests, a limit of 35 collisions
requires a dictionary with more than 10,000,000 integer keys to raise
an error. I am not talking about the attack, but valid data.


You might think that 10 million keys is a lot of data, but that's only about 
100 MB worth. I already see hardware vendors advertising computers with 6 GB 
RAM as entry level, e.g. the HP Pavilion starts with 6GB expandable to 16GB. 
I expect that there are already people using Python who will unpredictably hit 
that limit by accident, and the number will only grow as computers get more 
memory.


With a limit of 35 collisions, it only takes 35 keys to to force a dict to 
raise an exception, if you are an attacker able to select colliding keys. 
We're trying to defend against an attacker who is able to force collisions, 
not one who is waiting for accidental collisions. I don't see that causing the 
dict to raise an exception helps matters: it just changes the attack from 
keep the dict busy indefinitely to cause an exception and crash the 
application.


This moves responsibility from dealing with collisions out of the dict to the 
application code. Instead of solving the problem in one place (the built-in 
dict) now every application that uses dicts has to identify which dicts can be 
attacked, and deal with the exception.


That pushes the responsibility for security onto people who are the least 
willing or able to deal with it: the average developer, who neither 
understands nor cares about security, or if they do care, they can't convince 
their manager to care.


I suppose an exception is an improvement over the application hanging 
indefinitely, but I'd hardly call it a fix.


Ruby uses randomized hashes. Are there any other languages with a dict or 
mapping class that raises on too many exceptions?



--
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Ivan Kozik
On Fri, Jan 20, 2012 at 03:48, Guido van Rossum gu...@python.org wrote:
 I think that's because your collision-counting algorithm was much more
 primitive than MAL's.

Conceded.

 This,
 combined with the second problem (needing to catch an exception), led
 me to abandon this approach and write Securetypes, which has a
 securedict that uses SHA-1.  Not that I like this either; I think I'm
 happy with the randomize-hash() approach.


 Why did you need to catch the exception? Were you not happy with the program
 simply terminating with a traceback when it got attacked?

No, I wasn't happy with termination.  I wanted to treat it just like a
JSON decoding error, and send the appropriate response.

I actually forgot to mention the main reason I abandoned the
stop-at-N-collisions approach.  I had a server with a dict that stayed
in memory, across many requests.  It was being populated with
identifiers chosen by clients.  I couldn't have my server stay broken
if this dict filled up with a bunch of colliding keys.  (I don't think
I could have done another thing either, like nuke the dict or evict
some keys.)

Ivan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Carl Meyer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Victor,

On 01/19/2012 05:48 PM, Victor Stinner wrote:
[snip]
 Using a randomized hash may
 also break (indirectly) real applications because the application
 output is also somehow randomized. For example, in the Django test
 suite, the HTML output is different at each run. Web browsers may
 render the web page differently, or crash, or ... I don't think that
 Django would like to sort attributes of each HTML tag, just because we
 wanted to fix a vulnerability.

I'm a Django core developer, and if it is true that our test-suite has a
dictionary-ordering dependency that is expressed via HTML attribute
ordering, I consider that a bug and would like to fix it. I'd be
grateful for, not resentful of, a change in CPython that revealed the
bug and prompted us to fix it. (I presume that it is true, as it sounds
like you experienced it directly; I don't have time to play around at
the moment, but I'm surprised we haven't seen bug reports about it from
users of 64-bit Pythons long ago). I can't speak for the core team, but
I doubt there would be much disagreement on this point: ideally Django
would run equally well on any implementation of Python, and as far as I
know none of the alternative implementations guarantee hash or
dict-ordering compatibility with CPython.

I don't have the expertise to speak otherwise to the alternatives for
fixing the collisions vulnerability, but I don't believe it's accurate
to presume that Django would not want to fix a dict-ordering dependency,
and use that as a justification for one approach over another.

Carl
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8Y83oACgkQ8W4rlRKtE2cNawCg5q/p1+OOKFYDymDJGoClBBlg
WNAAn3xevD+0CqAQ+mFNHCBhtLgw8IYv
=HDOh
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Nick Coghlan
On Fri, Jan 20, 2012 at 2:00 PM, Steven D'Aprano st...@pearwood.info wrote:
 With a limit of 35 collisions, it only takes 35 keys to to force a dict to
 raise an exception, if you are an attacker able to select colliding keys.
 We're trying to defend against an attacker who is able to force collisions,
 not one who is waiting for accidental collisions. I don't see that causing
 the dict to raise an exception helps matters: it just changes the attack
 from keep the dict busy indefinitely to cause an exception and crash the
 application.

No, that's fundamentally misunderstanding the nature of the attack.
The reason the hash collision attack is a problem is because it allows
you to DoS a web service in a way that requires minimal client side
resources but can have a massive effect on the server. The attacker is
making a single request that takes the server an inordinately long
time to process, consuming CPU resources all the while, and likely
preventing the handling of any other requests (especially for an
event-based server, since the attack is CPU based, bypassing all use
of asynchronous IO).

With the 1000 collision limit in place, the attacker sends their
massive request, the affected dict quickly hits the limit, throws an
unhandled exception which is then caught by the web framework and
turned into a 500 Error response (or whatever's appropriate for the
protocol being attacked).

If a given web service doesn't *already* have a catch all handler to
keep an unexpected exception from bringing the entire service down,
then DoS attacks like this one are the least of its worries.

As for why other languages haven't gone this way, I have no idea.
There are lots of details relating to a language's hash and hash map
design that will drive how suitable randomisation is as an answer, and
it also depends greatly on how you decide to characterise the threat.

FWIW, Victor's analysis in the opening post of this thread matches the
conclusions I came to a few days ago, although he's been over the
alternatives far more thoroughly than I have.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Nick Coghlan
On Fri, Jan 20, 2012 at 2:54 PM, Carl Meyer c...@oddbird.net wrote:
 I don't have the expertise to speak otherwise to the alternatives for
 fixing the collisions vulnerability, but I don't believe it's accurate
 to presume that Django would not want to fix a dict-ordering dependency,
 and use that as a justification for one approach over another.

It's more a matter of wanting deployment of a security fix to be as
painless as possible - a security fix that system administrators can't
deploy because it breaks critical applications may as well not exist.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-19 Thread Glenn Linderman

On 1/19/2012 8:54 PM, Carl Meyer wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Victor,

On 01/19/2012 05:48 PM, Victor Stinner wrote:
[snip]

Using a randomized hash may
also break (indirectly) real applications because the application
output is also somehow randomized. For example, in the Django test
suite, the HTML output is different at each run. Web browsers may
render the web page differently, or crash, or ... I don't think that
Django would like to sort attributes of each HTML tag, just because we
wanted to fix a vulnerability.

I'm a Django core developer, and if it is true that our test-suite has a
dictionary-ordering dependency that is expressed via HTML attribute
ordering, I consider that a bug and would like to fix it. I'd be
grateful for, not resentful of, a change in CPython that revealed the
bug and prompted us to fix it. (I presume that it is true, as it sounds
like you experienced it directly; I don't have time to play around at
the moment, but I'm surprised we haven't seen bug reports about it from
users of 64-bit Pythons long ago). I can't speak for the core team, but
I doubt there would be much disagreement on this point: ideally Django
would run equally well on any implementation of Python, and as far as I
know none of the alternative implementations guarantee hash or
dict-ordering compatibility with CPython.

I don't have the expertise to speak otherwise to the alternatives for
fixing the collisions vulnerability, but I don't believe it's accurate
to presume that Django would not want to fix a dict-ordering dependency,
and use that as a justification for one approach over another.

Carl


It might be a good idea to have a way to seed the hash with some value 
to allow testing with different dict orderings -- this would allow tests 
to be developed using one Python implementation that would be immune to 
the different orderings on different implementations; however, 
randomizing the hash not only doesn't solve the problem for long-running 
applications, it causes non-deterministic performance from one run to 
the next even with the exact same data: a different (random) seed could 
cause collisions sporadically with data that usually gave good 
performance results, and there would be little explanation for it, and 
little way to reproduce the problem to report it or understand it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com