Re: fdups: calling for beta testers

2005-02-27 Thread Patrick Useldinger
John Machin wrote:
I've tested it intensively
Famous Last Words :-)
;-)
(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.
Not sure what you want me to do about it. I've decreased the minimum 
block size once more, to accomodate for more files of the same length 
without increasing the total amount of memory used.

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. 
Ooops...
A really stupid mistake on my side. Corrected.
(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:
Why not use the Python filecmp module?
Done. It's also faster AND it works better. Thanks for the suggestion.
Please fetch the new version from http://www.homepages.lu/pu/fdups.html.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread Peter Hansen
Patrick Useldinger wrote:
(9) Any good reason why the executables don't have .py extensions
on their names?
(9) Because I am lazy and Linux doesn't care. I suppose Windows does?
Unfortunately, yes.  Windows has nothing like the x permission
bit, so you have to have an actual extension on the filename and
Windows (XP anyway) will check it against the list of extensions
in the PATHEXT environment variable to determine if it should be
treated like an executable.
Otherwise you must type python and the full filename.
-Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread Serge Orlov
Peter Hansen wrote:
 Patrick Useldinger wrote:
 (9) Any good reason why the executables don't have .py
 extensions on their names?

 (9) Because I am lazy and Linux doesn't care. I suppose Windows does?

 Unfortunately, yes.  Windows has nothing like the x permission
 bit, so you have to have an actual extension on the filename and
 Windows (XP anyway) will check it against the list of extensions
 in the PATHEXT environment variable to determine if it should be
 treated like an executable.

 Otherwise you must type python and the full filename.

Or use exemaker, which IMHO is the best way to handle this
problem.

  Serge.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread Patrick Useldinger
Serge Orlov wrote:
Or use exemaker, which IMHO is the best way to handle this
problem.
Looks good, but I do not use Windows.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-26 Thread John Machin
On Sat, 26 Feb 2005 23:53:10 +0100, Patrick Useldinger
[EMAIL PROTECTED] wrote:

 I've tested it intensively

Famous Last Words :-)

Thanks for your feedback!

Here's some more:

(1) Manic s/w producing lots of files all the same size: the Borland
C[++] compiler produces a debug symbol file (.tds) that's always
384KB; I have 144 of these on my HD, rarely more than 1 in the same
directory.

Here's a snippet from a duplicate detection run:

DUP|393216|2|\devel\delimited\build\lib.win32-1.5\delimited.tds|\devel\delimited\build\lib.win32-2.1\delimited.tds
DUP|393216|2|\devel\delimited\build\lib.win32-2.3\delimited.tds|\devel\delimited\build\lib.win32-2.4\delimited.tds

(2) There appears to be a flaw in your logic such that it will find
duplicates only if they are in the *SAME* directory and only when
there are no other directories with two or more files of the same
size. The above duplicates were detected only when I made the
following changes to your script:


--- fdups   Sat Feb 26 06:41:36 2005
+++ fdups_jm.py Sun Feb 27 12:18:04 2005
@@ -29,13 +29,14 @@
 self.count = self.totalsize = self.inodecount =
self.slinkcount = 0
 self.gain  = self.bytescompared = self.bytesread  =
self.inodecount = 0
 for toplevel in args:
-os.path.walk(toplevel, self.buildList, None)
+os.path.walk(toplevel, self.updateDict, None)
 if self.count  0:
 self.compare()

-def buildList(self,arg,dirpath,namelist):
- build a dictionnary of files to be analysed, indexed by
length 
-files = {}
+def updateDict(self,arg,dirpath,namelist):
+ update a dictionary of files to be analysed, indexed by
length 
+# files = {}
+files = self.compfiles
 for filepath in namelist:
 fullpath = os.path.join(dirpath,filepath)
 if os.path.isfile(fullpath):
@@ -51,20 +52,23 @@
 if  size = MIN_FILESIZE:
 self.count += 1
 self.totalsize += size
+# is above totalling in the wrong place?
 if size not in files:
 files[size]=[fullpath]
 else:
 files[size].append(fullpath)
-for size in files:
-if len(files[size]) != 1:
-self.compfiles[size]=files[size]
+# for size in files:
+# if len(files[size]) != 1:
+# self.compfiles[size]=files[size]

 def compare(self):
  compare all files of the same size  - outer loop 
 sizes=self.compfiles.keys()
 sizes.sort()
 for size in sizes:
-self.comparefiles(size,self.compfiles[size])
+list_of_filenames = self.compfiles[size]
+if len(list_of_filenames)  1:
+   self.comparefiles(size, list_of_filenames)

 def comparefiles(self,size,filelist):
  compare all files of the same size  - inner loop 


(3) Your fdups-check gadget doesn't work on Windows; the commands
module works only on Unix but is supplied with Python on all
platforms. The results might just confuse a newbie:

(1, '{' is not recognized as an internal or external
command,\noperable program or batch file.)

Why not use the Python filecmp module?

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


fdups: calling for beta testers

2005-02-25 Thread Patrick Useldinger
Hi all,
I am looking for beta-testers for fdups.
fdups is a program to detect duplicate files on locally mounted 
filesystems. Files are considered equal if their content is identical, 
regardless of their filename. Also, fdups ignores symbolic links and is 
able to detect and ignore hardlinks, where available.

In contrast to similar programs, fdups does not rely on md5 sums or 
other hash functions to detect potentially identical files. Instead, it 
does a direct blockwise comparison and stops reading as soon as 
possible, thus reducing the file reads to a maximum.

fdups has been developed on Linux but should run on all platforms that 
support Python.

fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where 
you'll also find a link to download the tar.

I am primarily interested in getting feedback if it produces correct 
results. But as I haven't been programming in Python for a year or so, 
I'd also be interested in comments on code if you happen to look at it 
in detail.

Your help is much appreciated.
-pu
--
http://mail.python.org/mailman/listinfo/python-list


Re: fdups: calling for beta testers

2005-02-25 Thread John Machin

Patrick Useldinger wrote:

 fdups' homepage is at http://www.homepages.lu/pu/fdups.html, where
 you'll also find a link to download the tar.


fdups has no installation program. Just change into a temporary
directory, and type tar xfj fdups.tar.bz. You should also chown the
files according to your needs, and then copy the executables to your
PATH.

(1) It's actually .bz2, not .bz (2) Why annoy people with the
not-widely-known bzip2 format just to save a few % of a 12KB file?? (3)
Typing that on Windows command line doesn't produce a useful result (4)
Haven't you heard of distutils?

(5) if files[subgroup[j]]['flag'] and files[subgroup[i]]['buffer'] ==
files[subgroup[j]]['buffer']:

That's not the most readable code I've ever seen.

(6) You are keeping open handles for all files of a given size -- have
you actually considered the possibility of an exception like this:
IOError: [Errno 24] Too many open files: 'foo509'

Once upon a time, max 20 open files was considered as generous as 640KB
of memory. Looks like Bill thinks 512 (open files, that is) is about
right these days.

(7)

!   def compare(self):
! compare all files of the same size  - outer loop 
!sizes=self.compfiles.keys()
!sizes.sort()
!for size in sizes:
!self.comparefiles(size,self.compfiles[size])

Why sort? What's wrong with just two lines:

! for size, file_list in self.compfiles.iteritems():
! self.comparefiles(size, file_list)

(8) global
MIN_FILESIZE,MAX_ONEBUFFER,MAX_ALLBUFFERS,BLOCKSIZE,INODES

That doesn't sit very well with the 'everything must be in a class'
religion seemingly espoused by the following:

! class fDups:
!  encapsulates the whole logic 

(9) Any good reason why the executables don't have .py extensions
on their names?

All in all, a very poor out-of-the-box experience. Bear in mind that
very few Windows users would have even heard of bzip2, let alone have a
bzip2.exe on their machine. They wouldn't even be able to *open* the
box.
And what is chown -- any relation of Perl's chomp?

-- 
http://mail.python.org/mailman/listinfo/python-list