On Fri, 2010-06-25 at 08:45 -0400, michael.fe...@evonik.com wrote: > I am actually more interested in finding reliable solution > instead of discussing mathematics and probabilities.
The discussion of probabilities affects whether it would be justifiable to change Subversion to address hash collisions. > 1. You are comparing apples and oranges. > 2. you can't balance the possibility of one error > with the that of an other. All systems have a probability of failure, resulting from both human and mechanical elements within the system. It may be difficult to estimate precisely, but one can often establish a lower bound. The question is whether hash-indexed storage increases the probability of failure by a non-negligible amount. > It often results in something like: > square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...) We're discussing failure rates, not margin of error propagation. Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...) if the failure probabilities are independent. If your system has a probability of failure of one in a million from other factors, and we add in an independent failure probability of one in 2^32 from hash-indexed storage, then the overall system failure probability is one in 999767--that is, it doesn't change much. > 3. you over estimate the risk of undetected hardware faulty. I think you over-estimate the risk of hash-index storage collisions. > There is no evidence that the hash vales are > equally distributed on the data sets, which is import for > the us of hashing method in data fetching. A hash which had a substantially unequal distribution of hash values among inputs would not be cryptographically sound. > = 3,21*10^2427 sequences of Data of 1K size > represented by the same hash value. First, SHA-1 is a 160-bit hash, not a 128-bit hash. Second, the number you calculated does not inform the probability of a collision. If you have N samples, which are not specifically constructed as to break SHA-1, then probability of a SHA-1 collision is roughly N^2 / 2^160 (see "birthday paradox" for more precision). So, for example, with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32 probability of a SHA-1 collision in the rep cache. If you can construct a system with close to a one in four billion probability of error from other sources, kudos to you; if not, hash-indexed storage is not perceptibly increasing your error rate.