Bug#656142: ITP: duff -- Duplicate file finder

2012-01-19 Thread Kamal Mostafa
 On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
  * Package name: duff
  * URL : http://duff.sourceforge.net/

On Tue, 2012-01-17 at 09:56 +0100, Simon Josefsson wrote:
 If there aren't warnings about use of SHA1 in the tool, there should
 be. While I don't recall any published SHA1 collisions, SHA1 is
 considered broken and shouldn't be used if you want to trust your
 comparisons.  I'm assuming the tool supports SHA256 and other SHA2
 hashes as well?  It might be useful to make sure the defaults are
 non-SHA1.

Duff supports SHA1, SHA256, SHA384 and SHA512 hashes.  The default is
SHA1.  For comparison, rdfind supports MD5 but only SHA1 hashes.  Thanks
for the note Simon -- I'll bring it to the attention of the upstream
author, Camilla Berglund.  

On Tue, 2012-01-17 at 09:12 +, Lars Wirzenius wrote:
 rdfind seems to be quickest one, but duff compares well with hardlink,
 which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
 Debian so far.
 
 This was done using my benchmark-cmd utility in my extrautils
 collection (not in Debian): http://liw.fi/extrautils/ for source.

Thanks for the pointer to your benchmark-cmd tool, Lars.  Very handy!
My results with it mirrored yours -- of the similar tools, duff appears
to lag only rdfind in performance (for my particular dataset, at least).

I looked into duff's methods a bit and discovered a few easy performance
optimizations that may speed it up a bit more.  The author is reviewing
my proposed patch now, and seems very open to collaboration.

 Personally, I would be wary of using checksums for file comparisons,
 since comparing files byte-by-byte isn't slow (you only need to
 do it to files that are identical in size, and you need to read
 all the files anyway).

Byte-by-byte might well be slower then checksums, if you end up faced
with N2 very large (uncacheable) files of identical size but unique
contents.  They all need to be checked against each other so each of the
N files would need to be read N-1 times.   Anyway, duff actually *does*
offer byte-by-byte comparison as an option (rdfind does not).

 I also think we've now got enough of duplicate file finders in
 Debian that it's time to consider whether we need so many. It's
 too bad they all have incompatible command line syntaxes, or it
 would be possible to drop some. (We should accept a new one if
 it is better than the existing ones, of course. Evidence required.)

To me, the premise that a new package must be better than existing
similar ones (evidence required, no less) seems pretty questionable.
It may not be so easy to establish just what better than means, and it
puts us in a position of making value judgments for our users that they
should be able to make for themselves.

While I do think it is productive to compare performance of these
similar tools to each other, I don't see much value in pitting them
against each other in benchmark wars as criteria of acceptance into
Debian.

Here we have a good quality DFSG-compliant package with an active
upstream and a willing DD maintainer.  While similar tools do exist
already in Debian, they do not offer identical feature sets or user
interfaces, and only one of them has been shown to outperform duff in
quick spot checks.  Some users have expressed a preference for duff over
the others.  Does that make it better than the existing ones?  My
answer: Who cares? Nobody is making us choose only one.

In my view, its not really a problem if carry multiple duplicate file
detectors in Debian, and that we will best serve our users by letting
them choose their preferred tool for the job.  And by allowing such
packages into Debian we encourage their improvement, to everyone's
benefit.

 -Kamal



signature.asc
Description: This is a digitally signed message part


Bug#656142: ITP: duff -- Duplicate file finder

2012-01-17 Thread Simon Josefsson
Kamal Mostafa ka...@whence.com writes:

 Package: wnpp
 Severity: wishlist
 Owner: Kamal Mostafa ka...@whence.com


 * Package name: duff
   Version : 0.5
   Upstream Author : Camilla Berglund elmindr...@elmindreda.org
 * URL : http://duff.sourceforge.net/
 * License : Zlib
   Programming Lang: C
   Description : Duplicate file finder

 Duff is a command-line utility for identifying duplicates in a given set of
 files.  It attempts to be usably fast and uses the SHA family of message
 digests as a part of the comparisons.

If there aren't warnings about use of SHA1 in the tool, there should be.
While I don't recall any published SHA1 collisions, SHA1 is considered
broken and shouldn't be used if you want to trust your comparisons.  I'm
assuming the tool supports SHA256 and other SHA2 hashes as well?  It
might be useful to make sure the defaults are non-SHA1.

/Simon



-- 
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87wr8qk8r5@latte.josefsson.org



Bug#656142: ITP: duff -- Duplicate file finder

2012-01-17 Thread Lars Wirzenius
On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote:
 * Package name: duff
 * URL : http://duff.sourceforge.net/

A quick speed comparison:

real  user  system  max RSS  elapsed  cmd   
 (s)   (s) (s)(KiB)  (s)
 3.2   2.4 5.862784  5.8  hardlink --dry-run files  /dev/null  
 1.1   0.4 1.615424  1.6  rdfind files  /dev/null  
 1.9   0.2 2.2 9904  2.2  duff-0.5/src/duff -r files  /dev/null

rdfind seems to be quickest one, but duff compares well with hardlink,
which (see http://liw.fi/dupfiles/) was the fastest one I knew of in
Debian so far.

This was done using my benchmark-cmd utility in my extrautils
collection (not in Debian): http://liw.fi/extrautils/ for source.
The exact command to generate the above table:

benchmark-cmd \
--setup='genbackupdata --create=100m files' \
--setup='cp -a files/0 files/copy' \
--cleanup='rm -rf files' \
--verbose \
--command='hardlink --dry-run files  /dev/null' \
--command='rdfind files  /dev/null' \
--command='duff-0.5/src/duff -r files  /dev/null'

Personally, I would be wary of using checksums for file comparisons,
since comparing files byte-by-byte isn't slow (you only need to
do it to files that are identical in size, and you need to read
all the files anyway).

I also think we've now got enough of duplicate file finders in
Debian that it's time to consider whether we need so many. It's
too bad they all have incompatible command line syntaxes, or it
would be possible to drop some. (We should accept a new one if
it is better than the existing ones, of course. Evidence required.)

-- 
Freedom-based blog/wiki/web hosting: http://www.branchable.com/


signature.asc
Description: Digital signature


Bug#656142: ITP: duff -- Duplicate file finder

2012-01-16 Thread Kamal Mostafa
Package: wnpp
Severity: wishlist
Owner: Kamal Mostafa ka...@whence.com


* Package name: duff
  Version : 0.5
  Upstream Author : Camilla Berglund elmindr...@elmindreda.org
* URL : http://duff.sourceforge.net/
* License : Zlib
  Programming Lang: C
  Description : Duplicate file finder

Duff is a command-line utility for identifying duplicates in a given set of
files.  It attempts to be usably fast and uses the SHA family of message
digests as a part of the comparisons.



-- 
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20120116205813.24274.12515.reportbug@localhost6.localdomain6



Bug#656142: ITP: duff -- Duplicate file finder

2012-01-16 Thread Samuel Thibault
Kamal Mostafa, le Mon 16 Jan 2012 12:58:13 -0800, a écrit :
 Package: wnpp
 Severity: wishlist
 Owner: Kamal Mostafa ka...@whence.com
 
 
 * Package name: duff
   Version : 0.5
   Upstream Author : Camilla Berglund elmindr...@elmindreda.org
 * URL : http://duff.sourceforge.net/
 * License : Zlib
   Programming Lang: C
   Description : Duplicate file finder
 
 Duff is a command-line utility for identifying duplicates in a given set of
 files.  It attempts to be usably fast and uses the SHA family of message
 digests as a part of the comparisons.

What is it the benefit over fdupes, rdfind, ...?

Samuel


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120116210316.gs4...@type.famille.thibault.fr




--
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120116210316.gs4...@type.famille.thibault.fr



Bug#656142: ITP: duff -- Duplicate file finder

2012-01-16 Thread Axel Beckert
Hi,

Samuel Thibault wrote:
  * Package name: duff
Version : 0.5
Upstream Author : Camilla Berglund elmindr...@elmindreda.org
  * URL : http://duff.sourceforge.net/
  * License : Zlib
Programming Lang: C
Description : Duplicate file finder
  
  Duff is a command-line utility for identifying duplicates in a given set of
  files.  It attempts to be usably fast and uses the SHA family of message
  digests as a part of the comparisons.
 
 What is it the benefit over fdupes, rdfind, ...?

..., hardlink, ...

Some of my coworkers prefer duff over the tools available in Debian,
too. I'm though no more sure why, but it's possible that speed was one
argument, because they ran it over several TB of data. Will check
what exactly was the reason back then.

Was thinking about packaging it myself already, so I may also sponsor
Kamal's package when it's ready.

Regards, Axel
-- 
 ,''`.  |  Axel Beckert a...@debian.org, http://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
  `-|  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5



-- 
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120116213236.gy2...@sym.noone.org



Bug#656142: ITP: duff -- Duplicate file finder

2012-01-16 Thread Joerg Jaspert
 What is it the benefit over fdupes, rdfind, ...?
 ..., hardlink, ...

finddup from perforate

 Was thinking about packaging it myself already, so I may also sponsor
 Kamal's package when it's ready.

You just listed the third duplicate (and me no. 4), and still go blind
right on ohoh, i sponsor it. Why? I hope its conditional on it being
vastly better than any of the others (speed, functionality, ...) and not
just because.

Contrary to some common believe, Debian is not the dump for NIH, and
even if a little redundancy can't hurt, too much is just waste. Of our
time, of our mirrors (space and bandwidth), ...

-- 
bye, Joerg
Contrary to common belief, Arch:i386 is *not* the same as Arch: any.



-- 
To UNSUBSCRIBE, email to debian-wnpp-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87hazvb8sg@gkar.ganneff.de



Bug#656142: ITP: duff -- Duplicate file finder

2012-01-16 Thread Kamal Mostafa
On Mon, 2012-01-16 at 23:07 +0100, Joerg Jaspert wrote:
  What is it the benefit over fdupes, rdfind, ...?
  ..., hardlink, ...
 finddup from perforate

After a quick evaluation of the various find dupe files tools, I was
attracted to try duff because:

1. It looked easier to use than the others.
2. This quote from its website[1] was exactly what I was looking for:
Note that duff itself never modifies any files, but it's designed to
play nice with tools that do.  The other dupe cleaner utilities left me
worried that they might trash something important if I got my command
line options wrong or forgot a --dry-run flag.


  Was thinking about packaging it myself already, so I may also sponsor
  Kamal's package when it's ready.

Thanks Axel, but I'm a DD myself, so won't need a sponsor.


 You just listed the third duplicate (and me no. 4), and still go blind
 right on ohoh, i sponsor it. Why? I hope its conditional on it being
 vastly better than any of the others (speed, functionality, ...)

In my humble opinion, that would be an unreasonable pre-condition for
inclusion in Debian.  Our standard for inclusion should not be that a
new package must be vastly better than other similar packages.  That
would deny a new package the opportunity to build a user base and
possibly someday evolve to become the vastly better alternative
itself.

 -Kamal

ka...@whence.com
ka...@debian.org

[1] http://duff.sourceforge.net/



signature.asc
Description: This is a digitally signed message part


Bug#656142: ITP: duff -- Duplicate file finder

2012-01-16 Thread martin f krafft
also sprach Kamal Mostafa ka...@debian.org [2012.01.17.0049 +0100]:
 In my humble opinion, that would be an unreasonable pre-condition for
 inclusion in Debian.  Our standard for inclusion should not be that a
 new package must be vastly better than other similar packages.  That
 would deny a new package the opportunity to build a user base and
 possibly someday evolve to become the vastly better alternative
 itself.

Right, but I'd say it needs to be better and the maintainer needs to
be able to argue how it is better.

-- 
 .''`.   martin f. krafft madduck@d.o  Related projects:
: :'  :  proud Debian developer   http://debiansystem.info
`. `'`   http://people.debian.org/~madduckhttp://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
 
die zeit für kleine politik ist vorbei.
 schon das nächste jahrhundert
 bringt den kampf um die erdherrschaft.
 - friedrich nietzsche


digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)