This inquiry may either turn out to be about the suitability of the SHA-1 (160 bit digest) for file identification, the sha function in Python ... or about some error in my script. Any insight appreciated in advance.
I am trying to reduce duplicate files in storage at home - I have a large number files (e.g. MP3s) which have been stored on disk multiple times under different names or on different paths. The using applications will search down from the top path and find the files - so I do not need to worry about keeping track of paths. All seemed to be working until I examined my log files and found files with the same SHA digest had different sizes according to os.stat(fpath).st_size . This is on Windows XP. - Am I expecting too much of SHA-1? - Is it that the os.stat data on Windows cannot be trusted? - Or perhaps there is a silly error in my code I should have seen? Thanks - Eric - - - - - - - - - - - - - - - - - - Log file extract: Dup: no Path: F:\music\mp3s\01125.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 01125.mp3 Size: 63006 Dup: YES Path: F:\music\mp3s\0791.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0791.mp3 Size: 50068 Dup: YES Path: F:\music\mp3s\12136.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 12136.mp3 Size: 51827 Dup: YES Path: F:\music\mp3s\11137.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 11137.mp3 Size: 56417 Dup: YES Path: F:\music\mp3s\0991.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0991.mp3 Size: 59043 Dup: YES Path: F:\music\mp3s\0591.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0591.mp3 Size: 59162 Dup: YES Path: F:\music\mp3s\10140.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 10140.mp3 Size: 59545 Dup: YES Path: F:\music\mp3s\0491.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0491.mp3 Size: 63101 Dup: YES Path: F:\music\mp3s\0392.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0392.mp3 Size: 63252 Dup: YES Path: F:\music\mp3s\0891.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0891.mp3 Size: 65808 Dup: YES Path: F:\music\mp3s\0691.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0691.mp3 Size: 67050 Dup: YES Path: F:\music\mp3s\0294.mp3 Hash: 00b3acb529aae11df186ced8424cb189f062fa48 Name: 0294.mp3 Size: 67710 Code: # Dedup_inplace.py # vers .02 # Python 2.4.1 # Create a dictionary consisting of hash:path # Look for 2nd same hash and delete path testpath=r"F:\music\mp3s" logpath=r"C:\testlog6.txt" import os, sha def hashit(pth): """Takes a file path and returns a SHA hash of its string""" fs=open(pth,'r').read() sh=sha.new(fs).hexdigest() return sh def logData(d={}, logfile="c://filename999.txt", separator="\n"): """Takes a dictionary of values and writes them to the provided file path""" logstring=separator.join([str(key)+": "+d[key] for key in d.keys()])+"\n" f=open(logfile,'a') f.write(logstring) f.close() return def walker(topPath): fDict={} logDict={} limit=1000 freed_space=0 for root, dirs, files in os.walk(topPath): for name in files: fpath=os.path.join(root,name) fsize=os.stat(fpath).st_size fkey=hashit(fpath) logDict["Name"]=name logDict["Path"]=fpath logDict["Hash"]=fkey logDict["Size"]=str(fsize) if fkey not in fDict.keys(): fDict[fkey]=fpath logDict["Dup"]="no" else: #os.remove(fpath) --uncomment only when script proven logDict["Dup"]="YES" freed_space+=fsize logData(logDict, logpath, "\t") items=len(fDict.keys()) print "Dict entry: ",items, print "Cum freed space: ",freed_space if items > limit: break if items > limit: break def emptyNests(topPath): """Walks downward from the given path and deletes any empty directories""" for root, dirs, files in os.walk(topPath): for d in dirs: dpath=os.path.join(root,d) if len(os.listdir(dpath))==0: print "deleting: ", dpath os.rmdir(dpath) walker(testpath) emptyNests(testpath) -- http://mail.python.org/mailman/listinfo/python-list