Hi!

>> Lets [try] to quantify the probability of reaching the collision limit C
> with a hashtable of size N and assuming a random hash distribution. The
> upper bound for this should be (N choose C) * (1/N)^(C-1), with (1/N)^(C-1)
> being the probability that C elements hash to one value and (N over C) the
> number of C element subsets of an N element set. For large N and N >> C we
> approximate (N choose C) to (e*N/C)^C / sqrt(2pi*C). As such our upper
> bound becomes N * (e/C)^C / sqrt(2pi*C). Choosing N = 2^31 (largest
> possible hashtable) we get for C = 20 probability 8.9E-10 and for C = 100
> probability 2.3E-149. The patch uses C = 1000.

I assume you're talking here about per-hashtable limit?
Here I notice two things:

1. Hash keys are not really random. Which can both improve and make it
worse, but we have a lot of keys named "key" and not a lot of keys named
"\x07\xF5\xDD".

2. This does not account for the fact that if we have two keys used by
app colliding, we'll probably see this collision a lot. Of course, if
we're counting per-hashtable, that would make is somewhat less, but still.

To validate this theory, I applied the following patch:

https://gist.github.com/smalyshev/2c3c085998ba81d6c163a27386aaacd8

And run composer update on one of my projects. I got these lines:

Destroyed hash with coll = 48454
Destroyed hash with coll = 41871
Destroyed hash with coll = 41828
Destroyed hash with coll = 40535
Destroyed hash with coll = 32774
Destroyed hash with coll = 22782
Destroyed hash with coll = 21381
Destroyed hash with coll = 15232
Destroyed hash with coll = 14745
Destroyed hash with coll = 12216
Destroyed hash with coll = 11742
Destroyed hash with coll = 11625
Destroyed hash with coll = 8916
Destroyed hash with coll = 6820
Destroyed hash with coll = 5911
Destroyed hash with coll = 5429
Destroyed hash with coll = 3390
Destroyed hash with coll = 1188

I.e. there were 18 hashes with more than 1000 collisions on the course
of the run. Is there something wrong with my count or am I counting
wrong things (I didn't see the patch for limiting counts so maybe I'm
still counting wrong things).

> In other words, it is extremely unlikely that you hit the collision limit
> by accident, with non-malicious data. So no, the user does not have to
> discriminate normal from malicious causes. If the limit triggers, it's
> malicious.

I'm not so sure of this now - unless, again, I misunderstand what we're
counting.

-- 
Stas Malyshev
smalys...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to