Re: duplicate file finder (was: how can I make this script shorter?)

2005-02-24 Thread TZOTZIOY
On Wed, 23 Feb 2005 01:56:02 -0800, rumours say that Lowell Kirsh
[EMAIL PROTECTED] might have written:

Good idea about hashing part of the file before comparing entire files. 
It will make the script longer but the speed increase will most likely 
make it worth it.

My dupefind module was one of the first modules I wrote in python (it was a
shell script once...), so it's not appropriate to post.  However, rewriting it
was in my RSN list; after your post, I said, what the heck :)

I wouldn't describe it as /clean/ Python code, so any comments on style,
methodology (and of course, bugs!) are mostly welcome.  If you have any ideas
how to make it more Pythonic in style, perhaps we can even send it to ASPN.
On the upside, it's a good demo of Python dynamism and therefore power.

All first spaces are converted to pipe symbols to keep code indents, and sorry
for the assignment operators lacking spacing at the left, but it helped me know
what I meant in C, so I kept the habit in Python.



dupefinder.py -- a module to find duplicate files.

find_duplicate_files(*paths):
.   A function returning an iterable of lists of duplicate files
.   To be used as
.   for duplicate_files in dupefinder.find_duplicate_files(dir1, dir2...):
.   # process every group of duplicates


import os, md5, errno, sys

IGNORED_ERROR_CODES= frozenset( [errno.EACCES, errno.ENOENT] )

class File(object):
.   A wrapper for files to facilitate their comparison.
.   
.   Interface:
.   -   .get_hash(level) returns appropriate key for the current cmp level
.   -   .filename
.   -   .size

.   __slots__= filename, size, _hashes,

.   def __init__(self, filename):
.   self.filename= filename
.   self.size= os.path.getsize(filename)
.   if self.size == 0:
.   self._hashes= [0, None, None]
.   else:
.   self._hashes= [self.size]

.   def __repr__(self):
.   return %s('%s') % (self.__class__.__name__, self.filename)

.   def get_hash(self, level):
.   Return appropriate key for level of comparison.
.   level == 0 returns file size
.   level == 1 returns hash of first few kb
.   level == 2 returns md5 sum of whole file
.   if level = len(self._hashes):
.   if 1 = len(self._hashes):
.   self._hashes.append(self._get_partial_hash())
.   if 2 = len(self._hashes):
.   self._hashes.append(self._get_full_hash())
.   return self._hashes[level]

.   def _get_partial_hash(self):
.   fp= open(self.filename)
.   try:
.   return hash(fp.read(8192))
.   finally:
.   fp.close()

.   def _get_full_hash(self):
.   fp= open(self.filename, rb)
.   full_hash= md5.new()
.   while 1:
.   data= fp.read(65536)
.   if not data: break
.   full_hash.update(data)
.   fp.close()
.   return full_hash.digest()

class GenericFilesDict(dict):
.   A virtual wrapper for the dictionaries of file comparisons.
.   Not to be used on its own, but through subclassing.
.   
.   Each subclass should define a _level class attribute to be used with the
.   File.get_hash method, and a _next_class attribute pointing to the class
.   managing the next level comparisons.
.   __slots__= ()

.   def add_file(self, fileobj):
.   Add a File object to self keyed by the appropriate key based on
.   self._level. If another file object exists in the same spot, replace it
.   by a _next_level_class instance containing the pre-existing and new file
.   obj.

.   this_hash= fileobj.get_hash(self._level)

.   try:
.   result = self[this_hash]
.   except KeyError:
.   self[this_hash]= fileobj
.   return

.   try: # there was something in slot [this_hash]
.   result.add_file(fileobj) # so add fileobj to it
.   except AttributeError: # it has no add_file method, so it's a File
.   _= self[this_hash]= self._next_class() # make an instance
.   _.add_file(result) # add pre-existing
.   _.add_file(fileobj) # add new

.   def __repr__(self):
.   return %s%s % (self.__class__.__name__, dict.__repr__(self))

.   def __iter__(self):
.   Return all instances of SameFiles (subclass of list).
.   for item, value in self.iteritems():
.   try: _next_class= value._next_class
.   except AttributeError: continue
.   if _next_class is None:
.   yield value
.   else:
.   for item in value:
.   yield item

class SameFiles(list):
.   A list of duplicate files.
.   __slots__= ()
.   _level= 3
.   _next_class= None # search stops here

.   def add_file(self, fileobj):
.   self.append(fileobj)

class FilesByFullHash(GenericFilesDict):
.   A dictionary keyed on md5 hash of whole file.

.   The algorithm assures that all File objects in this dict
.   have the same size and the same hash of first few kiB.
.   __slots__= ()
.   _level= 2
.   _next_class= SameFiles

class 

Re: how can I make this script shorter?

2005-02-23 Thread Lowell Kirsh
Good idea about hashing part of the file before comparing entire files. 
It will make the script longer but the speed increase will most likely 
make it worth it.

Lowell
Christos TZOTZIOY Georgiou wrote:
On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh
[EMAIL PROTECTED] might have written:

I have a script which I use to find all duplicates of files within a 
given directory and all its subdirectories. It seems like it's longer 
than it needs to be but I can't figure out how to shorten it. Perhaps 
there are some python features or libraries I'm not taking advantage of.

The way it works is that it puts references to all the files in a 
dictionary with file size being the key. The dictionary can hold 
multiple values per key. Then it looks at each key and all the 
associated files (which are the same size). Then it uses filecmp to see 
if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

I can't advise on code length; my dupefind.py script is 361 lines, but the
algorithm is slightly more complex to speed things up (and it also optionally
hardlinks identical files on POSIX and NTFS filesystems).  If in your case there
are lots of files of several MiB each, that often match on size, you could avoid
lots of comparisons if you did match based on some hash (md5 or sha).
You could also compare first on the hash of the first few kiB (I check 8 kiB) to
see if you need to read the whole file or not.
So:
for every file:
  if other files exist with the same size:
calculate hash for first few kiB
if file has same initial hash with other files:
  calculate full hash
  return all files with same full hash
Something like that.
--
http://mail.python.org/mailman/listinfo/python-list


Re: how can I make this script shorter?

2005-02-23 Thread Lowell Kirsh
Thanks for the advice. There are definitely some performance issues I 
hadn't thought of before. I guess it's time to go lengthen, not shorten, 
the script.

Lowell
John Machin wrote:
Lowell Kirsh wrote:
I have a script which I use to find all duplicates of files within a
given directory and all its subdirectories. It seems like it's longer

than it needs to be but I can't figure out how to shorten it. Perhaps

there are some python features or libraries I'm not taking advantage
of.
The way it works is that it puts references to all the files in a
dictionary with file size being the key. The dictionary can hold
multiple values per key. Then it looks at each key and all the
associated files (which are the same size). Then it uses filecmp to
see
if they are actually byte-for-byte copies.
It's not 100% complete but it's pretty close.
Lowell

To answer the question in the message subject: 1,$d
And that's not just the completely po-faced literal answer that the
question was begging for: why write something when it's already been
done? Try searching this newsgroup; there was a discussion on this very
topic only a week ago, during which the effbot provided the URL of an
existing python file duplicate detector. There seems to be a discussion
every so often ...
However if you persist in DIY, read the discussions in this newsgroup,
search the net (people have implemented this functionality in other
languages); think about some general principles -- like should you use
a hash (e.g. SHA-n where n is a suitably large number). If there are N
files all of the same size, you have two options (a) do O(N**2) file
comparisons or (b) do N hash calcs followed by O(N**2) hash
comparisons; then deciding on your
need/whim/costs-of-false-negatives/positives you can stop there or you
can do the file comparisons on the ones which match on hashes. You do
however need to consider that calculating the hash involves reading the
whole file, whereas comparing two files can stop when a difference is
detected. Also, do you understand and are you happy with using the
(default) shallow option of filecmp.cmp()?
--
http://mail.python.org/mailman/listinfo/python-list


Re: how can I make this script shorter?

2005-02-22 Thread TZOTZIOY
On Tue, 22 Feb 2005 00:34:39 -0800, rumours say that Lowell Kirsh
[EMAIL PROTECTED] might have written:

I have a script which I use to find all duplicates of files within a 
given directory and all its subdirectories. It seems like it's longer 
than it needs to be but I can't figure out how to shorten it. Perhaps 
there are some python features or libraries I'm not taking advantage of.

The way it works is that it puts references to all the files in a 
dictionary with file size being the key. The dictionary can hold 
multiple values per key. Then it looks at each key and all the 
associated files (which are the same size). Then it uses filecmp to see 
if they are actually byte-for-byte copies.

It's not 100% complete but it's pretty close.

I can't advise on code length; my dupefind.py script is 361 lines, but the
algorithm is slightly more complex to speed things up (and it also optionally
hardlinks identical files on POSIX and NTFS filesystems).  If in your case there
are lots of files of several MiB each, that often match on size, you could avoid
lots of comparisons if you did match based on some hash (md5 or sha).

You could also compare first on the hash of the first few kiB (I check 8 kiB) to
see if you need to read the whole file or not.

So:

for every file:
  if other files exist with the same size:
calculate hash for first few kiB
if file has same initial hash with other files:
  calculate full hash
  return all files with same full hash

Something like that.
-- 
TZOTZIOY, I speak England very best.
Be strict when sending and tolerant when receiving. (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


how can I make this script shorter?

2005-02-22 Thread John Machin

Lowell Kirsh wrote:
 I have a script which I use to find all duplicates of files within a
 given directory and all its subdirectories. It seems like it's longer

 than it needs to be but I can't figure out how to shorten it. Perhaps

 there are some python features or libraries I'm not taking advantage
of.

 The way it works is that it puts references to all the files in a
 dictionary with file size being the key. The dictionary can hold
 multiple values per key. Then it looks at each key and all the
 associated files (which are the same size). Then it uses filecmp to
see
 if they are actually byte-for-byte copies.

 It's not 100% complete but it's pretty close.

 Lowell

To answer the question in the message subject: 1,$d

And that's not just the completely po-faced literal answer that the
question was begging for: why write something when it's already been
done? Try searching this newsgroup; there was a discussion on this very
topic only a week ago, during which the effbot provided the URL of an
existing python file duplicate detector. There seems to be a discussion
every so often ...

However if you persist in DIY, read the discussions in this newsgroup,
search the net (people have implemented this functionality in other
languages); think about some general principles -- like should you use
a hash (e.g. SHA-n where n is a suitably large number). If there are N
files all of the same size, you have two options (a) do O(N**2) file
comparisons or (b) do N hash calcs followed by O(N**2) hash
comparisons; then deciding on your
need/whim/costs-of-false-negatives/positives you can stop there or you
can do the file comparisons on the ones which match on hashes. You do
however need to consider that calculating the hash involves reading the
whole file, whereas comparing two files can stop when a difference is
detected. Also, do you understand and are you happy with using the
(default) shallow option of filecmp.cmp()?

-- 
http://mail.python.org/mailman/listinfo/python-list