Re: duplicate file finder (was: how can I make this script shorter?)
On Wed, 23 Feb 2005 01:56:02 -0800, rumours say that Lowell Kirsh [EMAIL PROTECTED] might have written: Good idea about hashing part of the file before comparing entire files. It will make the script longer but the speed increase will most likely make it worth it. My dupefind module was one of the first modules I wrote in python (it was a shell script once...), so it's not appropriate to post. However, rewriting it was in my RSN list; after your post, I said, what the heck :) I wouldn't describe it as /clean/ Python code, so any comments on style, methodology (and of course, bugs!) are mostly welcome. If you have any ideas how to make it more Pythonic in style, perhaps we can even send it to ASPN. On the upside, it's a good demo of Python dynamism and therefore power. All first spaces are converted to pipe symbols to keep code indents, and sorry for the assignment operators lacking spacing at the left, but it helped me know what I meant in C, so I kept the habit in Python. dupefinder.py -- a module to find duplicate files. find_duplicate_files(*paths): . A function returning an iterable of lists of duplicate files . To be used as . for duplicate_files in dupefinder.find_duplicate_files(dir1, dir2...): . # process every group of duplicates import os, md5, errno, sys IGNORED_ERROR_CODES= frozenset( [errno.EACCES, errno.ENOENT] ) class File(object): . A wrapper for files to facilitate their comparison. . . Interface: . - .get_hash(level) returns appropriate key for the current cmp level . - .filename . - .size . __slots__= filename, size, _hashes, . def __init__(self, filename): . self.filename= filename . self.size= os.path.getsize(filename) . if self.size == 0: . self._hashes= [0, None, None] . else: . self._hashes= [self.size] . def __repr__(self): . return %s('%s') % (self.__class__.__name__, self.filename) . def get_hash(self, level): . Return appropriate key for level of comparison. . level == 0 returns file size . level == 1 returns hash of first few kb . level == 2 returns md5 sum of whole file . if level = len(self._hashes): . if 1 = len(self._hashes): . self._hashes.append(self._get_partial_hash()) . if 2 = len(self._hashes): . self._hashes.append(self._get_full_hash()) . return self._hashes[level] . def _get_partial_hash(self): . fp= open(self.filename) . try: . return hash(fp.read(8192)) . finally: . fp.close() . def _get_full_hash(self): . fp= open(self.filename, rb) . full_hash= md5.new() . while 1: . data= fp.read(65536) . if not data: break . full_hash.update(data) . fp.close() . return full_hash.digest() class GenericFilesDict(dict): . A virtual wrapper for the dictionaries of file comparisons. . Not to be used on its own, but through subclassing. . . Each subclass should define a _level class attribute to be used with the . File.get_hash method, and a _next_class attribute pointing to the class . managing the next level comparisons. . __slots__= () . def add_file(self, fileobj): . Add a File object to self keyed by the appropriate key based on . self._level. If another file object exists in the same spot, replace it . by a _next_level_class instance containing the pre-existing and new file . obj. . this_hash= fileobj.get_hash(self._level) . try: . result = self[this_hash] . except KeyError: . self[this_hash]= fileobj . return . try: # there was something in slot [this_hash] . result.add_file(fileobj) # so add fileobj to it . except AttributeError: # it has no add_file method, so it's a File . _= self[this_hash]= self._next_class() # make an instance . _.add_file(result) # add pre-existing . _.add_file(fileobj) # add new . def __repr__(self): . return %s%s % (self.__class__.__name__, dict.__repr__(self)) . def __iter__(self): . Return all instances of SameFiles (subclass of list). . for item, value in self.iteritems(): . try: _next_class= value._next_class . except AttributeError: continue . if _next_class is None: . yield value . else: . for item in value: . yield item class SameFiles(list): . A list of duplicate files. . __slots__= () . _level= 3 . _next_class= None # search stops here . def add_file(self, fileobj): . self.append(fileobj) class FilesByFullHash(GenericFilesDict): . A dictionary keyed on md5 hash of whole file. . The algorithm assures that all File objects in this dict . have the same size and the same hash of first few kiB. . __slots__= () . _level= 2 . _next_class= SameFiles class
Re: how can I make this script shorter?
Good idea about hashing part of the file before comparing entire files. It will make the script longer but the speed increase will most likely make it worth it. Lowell Christos TZOTZIOY Georgiou wrote: On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh [EMAIL PROTECTED] might have written: I have a script which I use to find all duplicates of files within a given directory and all its subdirectories. It seems like it's longer than it needs to be but I can't figure out how to shorten it. Perhaps there are some python features or libraries I'm not taking advantage of. The way it works is that it puts references to all the files in a dictionary with file size being the key. The dictionary can hold multiple values per key. Then it looks at each key and all the associated files (which are the same size). Then it uses filecmp to see if they are actually byte-for-byte copies. It's not 100% complete but it's pretty close. I can't advise on code length; my dupefind.py script is 361 lines, but the algorithm is slightly more complex to speed things up (and it also optionally hardlinks identical files on POSIX and NTFS filesystems). If in your case there are lots of files of several MiB each, that often match on size, you could avoid lots of comparisons if you did match based on some hash (md5 or sha). You could also compare first on the hash of the first few kiB (I check 8 kiB) to see if you need to read the whole file or not. So: for every file: if other files exist with the same size: calculate hash for first few kiB if file has same initial hash with other files: calculate full hash return all files with same full hash Something like that. -- http://mail.python.org/mailman/listinfo/python-list
Re: how can I make this script shorter?
Thanks for the advice. There are definitely some performance issues I hadn't thought of before. I guess it's time to go lengthen, not shorten, the script. Lowell John Machin wrote: Lowell Kirsh wrote: I have a script which I use to find all duplicates of files within a given directory and all its subdirectories. It seems like it's longer than it needs to be but I can't figure out how to shorten it. Perhaps there are some python features or libraries I'm not taking advantage of. The way it works is that it puts references to all the files in a dictionary with file size being the key. The dictionary can hold multiple values per key. Then it looks at each key and all the associated files (which are the same size). Then it uses filecmp to see if they are actually byte-for-byte copies. It's not 100% complete but it's pretty close. Lowell To answer the question in the message subject: 1,$d And that's not just the completely po-faced literal answer that the question was begging for: why write something when it's already been done? Try searching this newsgroup; there was a discussion on this very topic only a week ago, during which the effbot provided the URL of an existing python file duplicate detector. There seems to be a discussion every so often ... However if you persist in DIY, read the discussions in this newsgroup, search the net (people have implemented this functionality in other languages); think about some general principles -- like should you use a hash (e.g. SHA-n where n is a suitably large number). If there are N files all of the same size, you have two options (a) do O(N**2) file comparisons or (b) do N hash calcs followed by O(N**2) hash comparisons; then deciding on your need/whim/costs-of-false-negatives/positives you can stop there or you can do the file comparisons on the ones which match on hashes. You do however need to consider that calculating the hash involves reading the whole file, whereas comparing two files can stop when a difference is detected. Also, do you understand and are you happy with using the (default) shallow option of filecmp.cmp()? -- http://mail.python.org/mailman/listinfo/python-list
Re: how can I make this script shorter?
On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh [EMAIL PROTECTED] might have written: I have a script which I use to find all duplicates of files within a given directory and all its subdirectories. It seems like it's longer than it needs to be but I can't figure out how to shorten it. Perhaps there are some python features or libraries I'm not taking advantage of. The way it works is that it puts references to all the files in a dictionary with file size being the key. The dictionary can hold multiple values per key. Then it looks at each key and all the associated files (which are the same size). Then it uses filecmp to see if they are actually byte-for-byte copies. It's not 100% complete but it's pretty close. I can't advise on code length; my dupefind.py script is 361 lines, but the algorithm is slightly more complex to speed things up (and it also optionally hardlinks identical files on POSIX and NTFS filesystems). If in your case there are lots of files of several MiB each, that often match on size, you could avoid lots of comparisons if you did match based on some hash (md5 or sha). You could also compare first on the hash of the first few kiB (I check 8 kiB) to see if you need to read the whole file or not. So: for every file: if other files exist with the same size: calculate hash for first few kiB if file has same initial hash with other files: calculate full hash return all files with same full hash Something like that. -- TZOTZIOY, I speak England very best. Be strict when sending and tolerant when receiving. (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
how can I make this script shorter?
Lowell Kirsh wrote: I have a script which I use to find all duplicates of files within a given directory and all its subdirectories. It seems like it's longer than it needs to be but I can't figure out how to shorten it. Perhaps there are some python features or libraries I'm not taking advantage of. The way it works is that it puts references to all the files in a dictionary with file size being the key. The dictionary can hold multiple values per key. Then it looks at each key and all the associated files (which are the same size). Then it uses filecmp to see if they are actually byte-for-byte copies. It's not 100% complete but it's pretty close. Lowell To answer the question in the message subject: 1,$d And that's not just the completely po-faced literal answer that the question was begging for: why write something when it's already been done? Try searching this newsgroup; there was a discussion on this very topic only a week ago, during which the effbot provided the URL of an existing python file duplicate detector. There seems to be a discussion every so often ... However if you persist in DIY, read the discussions in this newsgroup, search the net (people have implemented this functionality in other languages); think about some general principles -- like should you use a hash (e.g. SHA-n where n is a suitably large number). If there are N files all of the same size, you have two options (a) do O(N**2) file comparisons or (b) do N hash calcs followed by O(N**2) hash comparisons; then deciding on your need/whim/costs-of-false-negatives/positives you can stop there or you can do the file comparisons on the ones which match on hashes. You do however need to consider that calculating the hash involves reading the whole file, whereas comparing two files can stop when a difference is detected. Also, do you understand and are you happy with using the (default) shallow option of filecmp.cmp()? -- http://mail.python.org/mailman/listinfo/python-list