On Sun, Feb 26, 2017 at 07:29:30PM +0100, Branko Čibej wrote: > On 26.02.2017 18:26, Paul Hammant wrote: > > Why don't y'all take the same tactic as Git does - SHA1 the contents of the > > file *and a prepended a type/length field* ?. > > And when the hash-colliding files happen to have the same type and > length, as in the published collision... > > Ah, of course, Git is immune to that because it uses magic and pixie > dust as well.
As far as I understand, Google's SHA1 collision relies on the specific 320 byte prefix which is shared by both PDF files being fed to SHA1 before any other data. Git calculates a hash over 'blob LEN content-PDF-1' and 'blob LEN content-PDF-2'. It is the identical 'blob LEN' parts which prevent a collision of hashes of resulting git blob objects since they are prefixed to the common 320 byte prefix. If another collision were found which triggers when content of two files is prefixed with 'blob LEN' then git would have a problem. > The bottom line is that any data storage system that uses lossy > content-based indexing is vulnerable to hash collisions. And both > Subversion and Git developers were well aware of that when the > vulnerable features were designed. For normal, day-to-day usage, SHA-1 > collisions are no more likely now than they were a week ago. Right. The problem we have is that none of us ever never bothered to instrument SVN's code to simulate a hash collision and test what will happen. Of course we would expect only one of the contents to be stored. But the system should not break in the way it does today.