On Tue, 23 Feb 2016, David Wright wrote: > 1) I do what fdupes does, ie identify files (in a benevolent > environment) using the MD5 signature to detect duplicate > contents.
MD5 alone can be somewhat dangerous even in benevolent environments: if the data sets are large enough or you are just unlucky, you are going to hit a colision and corrupt-or-lose-data-on-dedup sooner or later. At least use data-size + hash. But even that won't save you for colisions... the "full fix" is to use the hash (or size + hash) as a screen to detect possible matches: when it matches, compare the two data-sets to ensure they're really equal before you trigger the dedup. I am not going to bother with the detail that you need to ensure one of the data sets can't/didn't change under you between the comparison and the dedup getting commited to storage. > 2) In view of your statement that faster hashes exist, I would > like to explore replacing my use of MD5 by such a hash. Any wide-enough hash will do if you use it just for screening, where you don't care for for any security properties of the hash. And at that point, you might as well use a wide-enough CRC (ensure it is vectorizable and get the compiler to vectorize it!) if it proves to be faster than crypto hashes... -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh