On 06.09.2015 02:08, Maxim Dounin wrote:

Well, not, I don't confuse anything. For sure, brute force attack
on a 128 bit hash requires approximately 2^64 attempts.

That is, a single nginx instance with 2^64 cached resources will
likely show up a collision. But that's not a number of resources
you'll be able to store on a single node - in particular, because
64-bit address space wouldn't be enough to address that many
cached items.

To obtain a collision of a 128-bit hash with at least 1%
probability, you'll need more than 10^18 resources cached on a
single node, which is not even close to a something possible as
well.

Assuming 1 billion of keys (which is way more than a single nginx
node can handle now, and will require about 125G of memory for a
cache keys zone), probability of a collision is less than 10^(-20).

Quoting https://en.wikipedia.org/wiki/Birthday_attack [2]:

For comparison, 10^(-18) to 10^(-15) is the uncorrectable bit
error rate of a typical hard disk.

1) I will try to explain you, that is not quite true with a small approximation: let our hash value be exact one byte large (8 bit), it can obtain 2^8 = 256 different hash values (and let it be perfect distributed). The relative frequency to encounter a collision - same hash for any random another inside this interval (Hv0 - Hv255) will be also 256, because circa each 256-th char sequence will obtain the same hash value (Hv0).
Will be such hash safe? Of course not, never.
But if we will hash any character sequences with max length 16 bytes, we will have 256^16 (~= 10^38) different variations of binary string (keys). The relation (and the probability (Pc) to have a collision for two any random strings) would be only (10^38/256 - 1)/10^38 * (10^38/256 - 2)/(10^38 - 1) ~= 0.000015. Small, right? But the relative frequency (Hrc) is still 256! This can be explained with really large count of different sequences, and so with large count of hash values (Hv1-Hv255) that are not equal with (Hv0).

But let us resolve the approximation: the hash value obtain 2^128 (~= 10^38), but how many values maximum should be hashed? It's unknown. Let our keys contain maximum 100 bytes, the count of variations of all possible strings will be 256^100 (~= 10^240). The probability to encounter of a collision and the relative frequency to encounter a collision will be a several order smaller (1e-78), but the relation between Hrc and Pc is comparable to example above (in the sense of relation between of both). And in this relation is similar (un)safe. Yes less likely (well 8 vs 128 bits) but still "unsafe".

And we can have keys with the length of 500 bytes...

And don't compare the probability of error rate in hard disks with probability of a collision for hashes of any *pseudo-random* two strings (stress mark to "pseudo-random"). This is in about the same as to compare a warm with soft.

I can write here larger two pages formulas to prove it. But... forget the probabilities and this approximation... we come to point 2.

2) For the *caching* it's at all not required to have such "safe" hash functions: - The hash function should create reasonably perfect distributed values; - The hash function should be fast as possible (we can get MurmurHash or something similar, significantly faster than md5); - We should always compare the keys, after cache entry with hash value was found, to be sure exact the same key was found; But that does not make our cache slower, because the generation of hash value can be faster through algorithms faster as md5 (for example the MMH3 is up to 90% faster as MD5); - In the very less likely case of collision we will just forget the cached entry with previous key or save it as array for serial access (not really expected by caching and large hash value, because rare and that is a cache - not a database that should always hold an entry).

I want implement that and post a PR, and can make it configurable (something like `fastcgi_cache_safe = on`) then you can compare the performance of it - would be faster as your md5/crc32 implementation.

Regards,
sebres.

_______________________________________________
nginx-devel mailing list
nginx-devel@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx-devel

Reply via email to