[EP <[EMAIL PROTECTED]>] > This inquiry may either turn out to be about the suitability of the > SHA-1 (160 bit digest) for file identification, the sha function in > Python ... or about some error in my script
It's your script. Always open binary files in binary mode. It's a disaster on Windows if you don't (if you open a file in text mode on Windows, the OS pretends that EOF occurs at the first instance of byte chr(26) -- this is an ancient Windows behavior that made an odd kind of sense in the mists of history, and has persisted in worship of Backward Compatibility despite that the original reason for it went away _long_ ago). ... > I am trying to reduce duplicate files in storage at home - I have a > large number files (e.g. MP3s) which have been stored on disk multiple > times under different names or on different paths. ... > All seemed to be working until I examined my log files and found files > with the same SHA digest had different sizes according to > os.stat(fpath).st_size . This is on Windows XP. > > - Am I expecting too much of SHA-1? No. SHA-1 should work well for this. > - Is it that the os.stat data on Windows cannot be trusted? It can be trusted to the extent that anything on Windows can be trusted ;-) ... > def hashit(pth): > """Takes a file path and returns a SHA hash of its string""" > fs=open(pth,'r').read() Make that 'rb' instead of 'r', and your problem will go away. Do note that there are faster ways to go about this. For example, if a particular file has a unique size among the files you're looking at, there's no reason to even open that file (let alone compute its hash). Group the files by size, and within a size group you can find duplicates by, e.g., first comparing their initial 16 bytes, then the next 32, then the next 64 ... etc. If duplicates are uncommon, this can be a huge savings. For example, here's output from an old dup-finding Python program of mine run over a directory tree which happens to contain no duplicates: Files ----- Total 10,718 Unique 10,718 Duplicate 0 # w/ unique size 10,053 # w/ unique prefix 665 Bytes ----- Total 1,401,668,015 Unique 1,401,668,015 Duplicate 0 Read 76,688 Excess 76,688 That last two lines mean it actually read a total of only about 77,000 file bytes on its way to proving there were no duplicates among about 11,000 files spanning about 1.4 gigabytes. No hashes are computed by that program. I'd use a hash if instead I were going to save a persistent (across runs) set of known hashes, so that answering "here's a new file -- same as one I already have?" could be done by computing its hash. While the program above generally reads "amazingly" few file bytes, it hammers the OS file directory services, and it would go faster to compute a hash of a new file than to run the whole analysis again. -- http://mail.python.org/mailman/listinfo/python-list