Le quintidi 5 ventôse, an CCXXIV, David Wright a écrit : > 1) I do what fdupes does, ie identify files (in a benevolent > environment) using the MD5 signature to detect duplicate > contents.
You did not specify the average size of files nor how sure you want to be. If the files are large, I would suggest to use a sparse hash function, i.e. a hash function that only reads small parts of the file, and do a full comparison or compute a strong hash only for files that have a collision on that. > >>> hashlib.algorithms_guaranteed > {'md5', 'sha1', 'sha224', 'sha512', 'sha384', 'sha256'} > >>> hashlib.algorithms_available > {'MD4', 'md5', 'md4', 'sha1', 'MD5', 'dsaWithSHA', 'whirlpool', 'sha', > 'SHA512', 'SHA256', 'ripemd160', 'sha512', 'SHA384', 'sha384', > 'dsaEncryption', 'RIPEMD160', 'sha256', 'SHA224', 'SHA1', > 'ecdsa-with-SHA1', 'DSA', 'SHA', 'sha224', 'DSA-SHA'} These are all cryptographic hash functions: too strong for a preliminary test, insufficient for absolute certainty. Still, you can easily benchmark.
signature.asc
Description: Digital signature