Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

2010-06-25 Thread Greg Hudson
On Fri, 2010-06-25 at 08:45 -0400, michael.fe...@evonik.com wrote:
> I am actually more interested in finding reliable solution 
> instead of discussing mathematics and probabilities.

The discussion of probabilities affects whether it would be justifiable
to change Subversion to address hash collisions.

> 1. You are comparing apples and oranges. 
> 2. you can't balance the possibility of one error 
>with the that of an other.

All systems have a probability of failure, resulting from both human and
mechanical elements within the system.  It may be difficult to estimate
precisely, but one can often establish a lower bound.  The question is
whether hash-indexed storage increases the probability of failure by a
non-negligible amount.

>It often results in something like: 
>  square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...)

We're discussing failure rates, not margin of error propagation.
Failure rates propagate as 1 - ((1 - failure_1) * (1 - failure_2) * ...)
if the failure probabilities are independent.

If your system has a probability of failure of one in a million from
other factors, and we add in an independent failure probability of one
in 2^32 from hash-indexed storage, then the overall system failure
probability is one in 999767--that is, it doesn't change much.

> 3. you over estimate the risk of undetected hardware faulty.

I think you over-estimate the risk of hash-index storage collisions.

>There is no evidence that the hash vales are 
>equally distributed on the data sets, which is import for 
>the us of hashing method in data fetching.

A hash which had a substantially unequal distribution of hash values
among inputs would not be cryptographically sound.

> = 3,21*10^2427 sequences of Data of 1K size 
> represented by the same hash value.

First, SHA-1 is a 160-bit hash, not a 128-bit hash.  Second, the number
you calculated does not inform the probability of a collision.

If you have N samples, which are not specifically constructed as to
break SHA-1, then probability of a SHA-1 collision is roughly N^2 /
2^160 (see "birthday paradox" for more precision).  So, for example,
with 2^64 representations (1.8 * 10^19), there would be a roughly 2^-32
probability of a SHA-1 collision in the rep cache.  If you can construct
a system with close to a one in four billion probability of error from
other sources, kudos to you; if not, hash-indexed storage is not
perceptibly increasing your error rate.




Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

2010-06-25 Thread michael . felke
Hello,

I am actually more interested in finding reliable solution 
instead of discussing mathematics and probabilities.
But to make it short, you are wrong!

1. You are comparing apples and oranges. 

   A corruption due to faulty hardware, is a random error, 
   because it's not inherent to method you use,
   in opposite to the data corruption by a SHA-1 collision 
   in the rep-sharing implementation, 
   which is a inherent bias of the implementation.

2. you can't balance the possibility of one error
   with the that of an other.

   the total faultiness of a system is the combination of 
   all possible errors, random and systematical once.
   It often results in something like:
 square_root( a_1* (error_1 ^2) + a_2 * (error_2 ^2) + ...) 
   As the rep-sharing SHA-1 also depends on the hardware, 
   it wouldn't be trivial to calculate the system faultiness of 
   this non-linear combined possibilities.
   For detail look at 
   http://en.wikipedia.org/wiki/Propagation_of_uncertainty,
   but in general the possibility of an error goes 
   in the calculation with it's seconded power.
 
3. you over estimate the risk of undetected hardware faulty.

   hardware faulty is a long known and 
   well controllable problem. 
   Operating and network system have long tradition in 
   implementing methods to detected data corruption by 
   hardware faulty. It's an essential part of there design.
   In addition chemical firms and chem. software developers 
   are doing a lot to detected and prevent data corruption,
   due to hardware faulty or any kind of other source,
   as willful acting humans. Like using checksums, 
   data-replication , virtual machines,  redundant systems etc.

4. you under estimate the error done by misusing math. methods.

   As I already said in my first e-mail. SHA-1 is developed 
   to detected random and willful data manipulation.
   It's a cryptographic hash, so that there is a low chance of 
   guessing or calculation a derived data sequence,
   which generates the same hash value as the original data. 
   But this is the only thing it ensures.
   There is no evidence that the hash vales are 
   equally distributed on the data sets, which is import for 
   the us of hashing method in data fetching.
   In fact, as it's a cryptographic hash, 
   you should not be able to calculate it, 
   because this would mean that you are able 
   to calculate sets of data resulting in the same hash value.
   So you can't conclude from the low chance of 
   guessing or calculation a derived data sequence to
   a low chance of hash collisions in general.

At last, I want to give a short example calculation:
if we have a hash value with the size of 128 Bits and we
assume the algorithms generates equally distributes hash values,
than there are 2^128 = 3,40*10^38 different hash values 
to represent data sequences. That sounds much.

But, how many different data sequences are there to represent?
Let us take short binary files of 1K e.g. 1024 Octets.
The 1. octet has 256 values which combine 
with 256 values of the 2. 
and 256 values of the 3. ... etc.
So there are 256^1024 = 1,09*10^2466 different data sequences 
of 1K size.
This means for every hash value there are 
(256^1024)/(2^128) 
= (2^(8*1024))/(2^128) 
= (2^(8192))/(2^128) 
= 2^(8192-128) 
= 2^8064 
= 3,21*10^2427 sequences of Data of 1K size
represented by the same hash value. 

I hope this give a clue on the problem 
we have with this implantation and 
why I am so interested in finding a reliable solution.
Now I hope to find one or two experienced subversion developers,
how are willing to assist me in solving the problem.

Greetings

Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
michael.fe...@evonik.com
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld
http://www.evonik.com

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail 
messages attached to it may contain information that is confidential or 
legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that you must not read this transmission and that any disclosure, 
copying, printing, distribution or use of any of the information contained 
in or attached to this transmission is STRICTLY PROHIBITED. If you have 
received this transmission in error, please immediately notify the sender 
by telephone or return e-mail and delete the original transmission and its 
attachments without reading or saving in any manner. Thank you. 





Greg Hudson 
24.06.2010 18:41
 
An: "michael.fe...@evonik.com" 
Kopie:  "dev@subversion.apache.org" 
        Thema:  Re: Antwort: Re: dangerous implementation of rep-sharing 
cache for fsfs


Re: Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

2010-06-24 Thread Greg Hudson
On Thu, 2010-06-24 at 11:29 -0400, michael.fe...@evonik.com wrote:
> We must ensure that the data in the repository is, without any concerns, 
> the data we have once measured or written. 

You do realize that the probability of data corruption due to faulty
hardware is much, much more likely than the probability of corruption
due to a rep-sharing SHA-1 collision, right?




Antwort: Re: dangerous implementation of rep-sharing cache for fsfs

2010-06-24 Thread michael . felke
Hello,

sorry, but out E-mailing system doesn't support the usual way of 
citating the message replied to.

First, we are using svn in chem. laboratory to save, archive and 
version data and methods of our measurements. 
We must ensure that the data in the repository is, without any concerns, 
the data we have once measured or written. 
So, only the totally reliable Solution for the rep-sharing cache 
would be acceptable to us.

Yes, and i am interested in helping you to improve Subversion 
by writing needed code. But i am not sure that i will be able to
compile subversion completely here at work, i could try.
Perhaps someone is willing to help me testing my code?

Thanks for the hint to svn_fs_fs__set_rep_reference, 
because it didn't expected the additional checks to be there.
I locked there, but couldn't get at a first glance,
 when this check is performed. I will go deeper later.

I think it's better to add an check on md5 than any on part of fulltext,
because it's calculated on the hole data, too.
But is isn't imported to me, 
it only reduces the risk it does not eliminate it.

Greetings

P.S. I am also sorry for the signature, we are recommended to use.

Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
michael.fe...@evonik.com
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld
http://www.evonik.com

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail 
messages attached to it may contain information that is confidential or 
legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that you must not read this transmission and that any disclosure, 
copying, printing, distribution or use of any of the information contained 
in or attached to this transmission is STRICTLY PROHIBITED. If you have 
received this transmission in error, please immediately notify the sender 
by telephone or return e-mail and delete the original transmission and its 
attachments without reading or saving in any manner. Thank you. 





Daniel Shahaf 
24.06.2010 16:38
 
An: Julian Foad 
Kopie:  michael.fe...@evonik.com, dev@subversion.apache.org
Thema:  Re: dangerous implementation of rep-sharing cache for fsfs


Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -:
> I am not sure whether the "representation" whose SHA-1 sum is stored is
> ever an exact copy of the user's file.  If it is - if it does not
> include an extra header and is not stored in a delta format - then the

That is not the case:

[[[
A representation begins with a line containing either "PLAIN\n" or
"DELTA\n" or "DELTA   \n", where , ,
and  give the location of the delta base of the representation
and the amount of data it contains (not counting the header or
trailer).  If no base location is given for a delta, the base is the
empty stream.  After the initial line comes raw svndiff data, followed
by a cosmetic trailer "ENDREP\n".
]]]

So, there are header, trailer, and it's possibly deltified or 
self-deltified.

> chance of collision would depend directly on the content of the user's
> files.  If that is the case, it *might* be advisable to disable the
> rep-cache feature if you are storing files that have a higher than usual
> chance of SHA-1 collisions - data files for SHA-1 research, for example.
> 
> We should find out the answer to that question before going further.
> 
> 
> > Indeed, the number of hash collisions is only finite for a given file 
> > size, but is still increasing dramatically with the file size.
> > So additional checking of the file size helps but is not a completely 
> > satisfying solution.
> > 
> > The number of undetected hash collisions could be reduced easily by 
also 
> > checking the md5-checksum, the size and the expanded-size.
> 

Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert
that the size and expanded size match.

It's indeed possible to also use md5 there.  Another option is to use
practically any statistic about the fulltext: the first N bytes, the
number of '#' characters, ...

> True.  This approach could be beneficial if there are cases where the
> perfect solution (below) is not feasible.
> 
> > To make this feature totally reliable, a complete comparison of the 
files 
> > content with the content of the old representation found, is necessary
> 
> Yes, it would be good if Subversion could do this extra check.  Would
> you be interested in helping to improve Subversion by writing code to do
> this?  If so, you will be very welcome and we will try to help you.
> 

+1 from me too.

> (I recall reading about an option in Git (?) to switch on full-text
> comparisons to check for SHA-1 collisions.  I can't find a reference to
> it now.)
> 
> 
> Regards,