On Fri, 2010-06-25 at 08:45 -0400, michael.fe...@evonik.com wrote:
> I am actually more interested in finding reliable solution 
> instead of discussing mathematics and probabilities.

The discussion of probabilities affects whether it would be justifiable
to change Subversion to address hash collisions.

> 1. You are comparing apples and oranges. 
> 2. you can't balance the possibility of one error 
>    with the that of an other.

All systems have a probability of failure, resulting from both human and
mechanical elements within the system.  It may be difficult to estimate
precisely, but one can often establish a lower bound.  The question is
whether hash-indexed storage increases the probability of failure by a
non-negligible amount.

>    It often results in something like: 
>      square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...)

We're discussing failure rates, not margin of error propagation.
Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...)
if the failure probabilities are independent.

If your system has a probability of failure of one in a million from
other factors, and we add in an independent failure probability of one
in 2^32 from hash-indexed storage, then the overall system failure
probability is one in 999767--that is, it doesn't change much.

> 3. you over estimate the risk of undetected hardware faulty.

I think you over-estimate the risk of hash-index storage collisions.

>    There is no evidence that the hash vales are 
>    equally distributed on the data sets, which is import for 
>    the us of hashing method in data fetching.

A hash which had a substantially unequal distribution of hash values
among inputs would not be cryptographically sound.

> = 3,21*10^2427 sequences of Data of 1K size 
> represented by the same hash value.

First, SHA-1 is a 160-bit hash, not a 128-bit hash.  Second, the number
you calculated does not inform the probability of a collision.

If you have N samples, which are not specifically constructed as to
break SHA-1, then probability of a SHA-1 collision is roughly N^2 /
2^160 (see "birthday paradox" for more precision).  So, for example,
with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32
probability of a SHA-1 collision in the rep cache.  If you can construct
a system with close to a one in four billion probability of error from
other sources, kudos to you; if not, hash-indexed storage is not
perceptibly increasing your error rate.


Reply via email to