On Thu, Aug 31, 2017 at 10:02 PM, Mike Small <sma...@sdf.org> wrote: > John Abreau <abre...@gmail.com> writes: > >> I've heard of tools using MD5 or SHA1 hashes to identify duplicates, and >> potential issues with hash collisions causing false positives. > > By accident or maliciously? The numbers seem off for accidental > collisions. An md5 sum is a 16 digit hex number. That gives > 340282366920938463463374607431768211456 potential hash sums (or does the > algorithm offer only a smaller subset?). I'm not going to bother to > compute the probability of a collision. It's a very remote possiblity, > yes? For the malicious case, if someone's able to mess with the hashes > used by deduplication code in your file system or in your hopefully > almost as good userland equivalent (which of course must use git in some > way or another for reasons that are not clear to me) you have unsolvable > problems.
Does git only compare the checksum or does it also look at file size as well? I would think that comparing file size might make it even harder to get a collision. The only duplicate checksum that I've ever seen in practice was on 0 length files. Zero length files are, of course, all perfect duplicates of each other... :-) Bill Bogstad _______________________________________________ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss