Re: Comparing large amounts of files

2009-12-29 Thread J.C. Roberts
Sorry for the additional post, but it's a fun problem when you consider
the possible messed up file/path names that could be created by a raw
database dump. I made some improvements to the example script to make
sure messed up file/path names are handled correctly.


#!/bin/ksh

# There are plenty of caveats to be aware of to make sure you process
# strings (paths and file names) correctly.
#
# The first is the built-in ksh 'echo' will strip backslashes by 
# default, unless used with the '-E' switch. This is the opposite of
# bash where the backslashes are not stripped by default, but the '-e'
# switch forces stripping. The easy answer is to just use printf(1), 
# rather than worry if you got the 'echo' correct.
#
# Another caveat is when using 'while read', you need to use the '-r'
# flag to prevent stripping of backslashes.
#
# Also, quoting your variables is highly recommended. This prevents a
# number of issues including odd behavior due to spaces in strings.

# --- test setup 
# clean up
rm -rf d1 d2 d3
rm -rf d\ 4  
rm -rf d 5
rm -rf ./-d
rm -rf 'x\\ 2'
rm -rf '*x'
rm -rf 'xx'
rm all.txt unq.txt dup.txt

# We need some seriously messed up directory names.
mkdir d1 d2 d3 d\ 4 d 5 ./-d 'x\\ 2' '*x' 'xx'

# And we need some seriously messed up file names in each directory.
for i in `ls -1 ../0[1,2,3,4,5].png`; do 
  cp $i ./d1/`printf %s $i | sed -e 's/0/1/' -e 's/^..\///'`; 
  cp $i ./d2/`printf %s $i | sed -e 's/0/2/' -e 's/^..\///'`; 
  cp $i ./d3/`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; 
  cp $i ./d\ 4/`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; 
  cp $i ./d 5/`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; 
  cp $i ./-d/`printf %s $i | sed -e 's/0/-/' -e 's/^..\///'`; 
  cp $i './x\\ 2/'`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; 
  cp $i './*x/'`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; 
  cp $i './xx/'`printf %s $i | sed -e 's/0/?/' -e 's/^..\///'`; 
done;

# --- Actually Useful ---
# Find all the files, and generate cksum for each
find ./d1 ./d2 ./d3 ./d\ 4/ ./d 5/ ./-d/ './x\\ 2/' '*x' 'xx' \
  -type f -print0 | xargs -0 cksum -r -a sha256 all.txt

# For the sake of your sanity, you want the leading ./ or better,
# the fully qualified path to the directories where you want to
# run find(1). This will save you from a lot of possible mistakes
# caused by messed up directory and/or file names.

# After you have a cksum hash for every file, you want to make sure
# your list is sorted, or else other commands will fail since they
# typically expect lists to be pre-sorted.
sort -k 1,1 -o all.txt all.txt

# Generate a list of unique files based on the hash
sort -k 1,1 -u -o unq.txt all.txt 

# To get a list of just the duplicates using comm(1)
comm -2 -3 all.txt unq.txt dup.txt 

# --- Possibly Useful ---
# Sure, once you have your list of duplicates, you're pretty well set
# assuming you're not afraid of hash collisions. If the data is very
# valuable, and you don't want to risk a hash collision causing you to
# delete something important, you need to use cmp(1) to make sure the
# files really are *idendical* duplicates.

IFS='
'
for UNQ_LINE in `cat unq.txt`; do
  # Grab the hash from the line.
  UNQ_HASH=`printf %s $UNQ_LINE | sed -E -e 's/ .+$//'`;
  # Grab the full path and file name.
  UNQ_FILE=`printf %s $UNQ_LINE | sed -E -e 's/^[a-f0-9]{64,64} //'`;
  printf \n;
  printf UNQ_HASH: %s\n $UNQ_HASH;
  printf UNQ_FILE: %s\n $UNQ_FILE;
  # use the look(1) command to find matching hashes in the duplicates
  for DUP_LINE in `look $UNQ_HASH dup.txt`; do
DUP_HASH=`printf %s $DUP_LINE | sed -E -e 's/ .+$//'`;
DUP_FILE=`printf %s $DUP_LINE | sed -E -e 's/^[a-f0-9]{64,64} //'`;
printf DUP_HASH: %s\n $DUP_HASH;
printf DUP_FILE: %s\n $DUP_FILE;
if `cmp $UNQ_FILE $DUP_FILE`; then
  printf Binary Compare Matches\n;
  # rm $DUP_FILE;
else
  printf ERROR: Hash Collision\n;
  exit 1;
fi
  done;
done;

# Another way to do it, without using sed. 
# Reset the IFS back to space
IFS=' ';
cat unq.txt | while read -r UNQ_HASH UNQ_FILE; do
  printf \n;
  printf UNQ_HASH: %s\n $UNQ_HASH;
  printf UNQ_FILE: %s\n $UNQ_FILE;
  look $UNQ_HASH dup.txt | while read -r DUP_HASH DUP_FILE; do
printf DUP_HASH: %s\n $DUP_HASH;
printf DUP_FILE: %s\n $DUP_FILE;
if `cmp $UNQ_FILE $DUP_FILE`; then
  printf Binary Compare Matches\n;
  # rm $DUP_FILE;
else
  printf ERROR: Hash Collision\n;
  exit 1;
fi
  done; 
done; 


exit 0





-- 
J.C. Roberts



Re: Comparing large amounts of files

2009-12-28 Thread J.C. Roberts
On Fri, 11 Dec 2009 18:24:24 -0500 STeve Andre' and...@msu.edu
wrote:

I am wondering if there is a port or otherwise available
 code which is good at comparing large numbers of files in
 an arbitrary number of directories?  I always try avoid
 wheel re-creation when possible.  I'm trying to help some-
 one with large piles of data, most of which is identical
 across N directories.  Most.  Its the 'across dirs' part
 that involves the effort, hence my avoidance of thinking
 on it if I can help it. ;-)
 
 Thanks, STeve Andre'
 

Sorry for the late reply STeve, but this might still be helpful for you,
or possibly others through the list archives. Though it doesn't come up
too often, you've asked a fairly classic question. Most people resort
to finding some pre-made tool for this sort of thing, simply because
they don't know that everything they need is already in the base OpenBSD
operating system.

Considering your earlier post about moving *huge* files (greater than
10 GiB), the method of file comparison you use could be real taxing on
the system. Our cksum(1) command gives you plenty of possible
trade-offs between accuracy (uniqueness) versus speed for various
check sum (hash) algorithms. 

Assuming you don't want to accidentally delete the wrong data, SHA256
might be a good choice, but there is still the potential for hash
collisions. Using two or more hashes (such as the once typical
combination of MD5 and SHA256), only *reduces* the chance of collision,
rather than *eliminating* the chance of collision. If you absolutely
must be certain you're not mistakenly deleting something due to a hash
collision, then you need to use the cmp(1) command before deleting
anything.

Since you're dealing with a somewhat raw dump of a database, you could
have some potentially dangerous directory or file names mixed into
the resulting dump. Handling these properly is painful, requires some
thinking, and I may not have it completely correct.


#!/bin/ksh

# Create a test/ directory
mkdir ./test

# Copy some random image files into our new test/ directory
cp ../0[1,2,3,4,5].png test/.

# $ ls -lF test/
# total 1696
# -rw-r--r--  1 jcr  users  126473 Dec 28 00:02 01.png
# -rw-r--r--  1 jcr  users  157483 Dec 28 00:02 02.png
# -rw-r--r--  1 jcr  users  209387 Dec 28 00:02 03.png
# -rw-r--r--  1 jcr  users  163780 Dec 28 00:02 04.png
# -rw-r--r--  1 jcr  users  188546 Dec 28 00:02 05.png

# Now we need some messed up directory names.
mkdir d1 d2 d3 d\ 4 d 5 ./-d

# And we need some messed up file names in each directory.
for i in test/*.png; do 
  cp $i ./d1/`echo $i | sed -e 's/0/1/' -e 's/^test\///'`; 
  cp $i ./d2/`echo $i | sed -e 's/0/2/' -e 's/^test\///'`; 
  cp $i ./d3/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`; 
  cp $i ./d\ 4/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`; 
  cp $i ./d 5/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`; 
  cp $i ./-d/`echo $i | sed -e 's/0/-/' -e 's/^test\///'`; 
done;

# Find all the files, and generate cksum for each
find ./d1 ./d2 ./d3 ./d\ 4/ ./d 5/ ./-d/ -type f -print0 | \
  xargs -0 cksum -r -a sha256 all.txt

# For the sake of your sanity, you want the leading ./ or better,
# the fully qualified path to the directories where you want to
# run find(1). This will save you from a lot of possible mistakes
# caused by screwed up directory and/or file names.

# After you have a cksum hash for every file, you want to make sure
# your list is sorted, or else other commands will fail since they
# typically expect lists to be pre-sorted.
sort -k 1,1 -o all.txt all.txt

# Generate a list of unique files based on the hash
sort -k 1,1 -u -o unq.txt all.txt 

# To get a list of just the duplicates using comm(1)
comm -2 -3 all.txt unq.txt dup.txt 

# Sure, once you have your list of duplicates, you're pretty well set
# assuming you're not afraid of hash collisions. If the data is very
# valuable, and you don't want to risk a hash collision causing you to
# delete something important, you need to use cmp(1) to make sure the
# files really are *idendical* duplicates.

# The `while read VAR;do ...; done  file.txt;` construct fails when
# a backslash followed by a space is present in the input file. In
# essence it seems to be escaping the space, and thereby dropping the
# leading backslash. i.e.
# ./d\ 4/file.png
#
# Since we can't trust `while read ...` to do the right thing, we have
# to resort to resetting the Internal Field Separator (IFS) to a new
# line, so we can grab entire lines.
IFS='
'

# As for doing a binary compare with cmp(1), it's fairly simple now.

for UNQ_LINE in `cat unq.txt`; do
  # Grab the hash from the line.
  UNQ_HASH=`echo $UNQ_LINE | sed -E -e 's/ .+$//'`;
  # Grab the full path and file name.
  UNQ_FILE=`echo $UNQ_LINE | sed -E -e 's/^[a-f0-9]{64,64} //'`;
  # use the look(1) command to find matching hashes in the duplicates
  # or you could use grep(1) or whatever else fits your fancy.
  for 

Re: Comparing large amounts of files

2009-12-28 Thread J.C. Roberts
On Mon, 28 Dec 2009 18:40:57 -0800 J.C. Roberts
list-...@designtools.org wrote:

 # The `while read VAR;do ...; done  file.txt;` construct fails when
 # a backslash followed by a space is present in the input file. In
 # essence it seems to be escaping the space, and thereby dropping the
 # leading backslash. i.e.
 # ./d\ 4/file.png
 #

The above is because I'm an idiot and forgot the '-r' flag for 'read'

-- 
J.C. Roberts



Re: Comparing large amounts of files

2009-12-15 Thread Nick Bender
On Saturday, December 12, 2009, Andy Hayward a...@buteo.org wrote:
 On Fri, Dec 11, 2009 at 23:24, STeve Andre' and...@msu.edu wrote:
 B  I am wondering if there is a port or otherwise available
 code which is good at comparing large numbers of files in
 an arbitrary number of directories? B I always try avoid
 wheel re-creation when possible. B I'm trying to help some-
 one with large piles of data, most of which is identical
 across N directories. B Most. B Its the 'across dirs' part
 that involves the effort, hence my avoidance of thinking
 on it if I can help it. ;-)

 sysutils/fdupes

 -- ach


If you have a database available yo can store file hashes and use SQL.
I used postgres for the job and had reasonable performance on a 10
million file collection. I stored directory paths in one table and
filename, size, and sha1 in another table. Scripting the table
creation was fairly easy...

-N



Re: Comparing large amounts of files

2009-12-12 Thread Paul M

On 12/12/2009, at 4:22 PM, Frank Bax wrote:


STeve Andre' wrote:

but am trying to come up with a reasonable way
of spotting duplicates, etc.


You mean like this...

$ cp /etc/firmware/zd1211-license /tmp/XX1
$ cp /var/www/icons/dir.gif /tmp/XX2
$ fdupes /etc/firmware/ /var/www/icons/ /tmp/
/tmp/XX2
/var/www/icons/dir.gif
/var/www/icons/folder.gif

snip



When comparing a very large number of files, the -1 flag to
fdupes makes it much to manage the output (IMO). Then each
line printed is a list of duplicate files.

But it depends on your own needs.


After seeing what you're trying to do, I'd say fdupes is a
perfect fit.


paulm



Re: Comparing large amounts of files

2009-12-12 Thread Liviu Daia
On 11 December 2009, STeve Andre' and...@msu.edu wrote:
I am wondering if there is a port or otherwise available
 code which is good at comparing large numbers of files in
 an arbitrary number of directories?  I always try avoid
 wheel re-creation when possible.  I'm trying to help some-
 one with large piles of data, most of which is identical
 across N directories.  Most.  Its the 'across dirs' part
 that involves the effort, hence my avoidance of thinking
 on it if I can help it. ;-)

Try this tiny Perl script:

http://hqbox.org/files/fdupe.pl

It's still faster than all of its competitors I'm aware of (most of
them written in C). :)

Regards,

Liviu Daia

-- 
Dr. Liviu Daia  http://www.imar.ro/~daia



Re: Comparing large amounts of files

2009-12-12 Thread Andy Hayward
On Fri, Dec 11, 2009 at 23:24, STeve Andre' and...@msu.edu wrote:
 B  I am wondering if there is a port or otherwise available
 code which is good at comparing large numbers of files in
 an arbitrary number of directories? B I always try avoid
 wheel re-creation when possible. B I'm trying to help some-
 one with large piles of data, most of which is identical
 across N directories. B Most. B Its the 'across dirs' part
 that involves the effort, hence my avoidance of thinking
 on it if I can help it. ;-)

sysutils/fdupes

-- ach



Comparing large amounts of files

2009-12-11 Thread STeve Andre'
   I am wondering if there is a port or otherwise available
code which is good at comparing large numbers of files in
an arbitrary number of directories?  I always try avoid
wheel re-creation when possible.  I'm trying to help some-
one with large piles of data, most of which is identical
across N directories.  Most.  Its the 'across dirs' part
that involves the effort, hence my avoidance of thinking
on it if I can help it. ;-)

Thanks, STeve Andre'



Re: Comparing large amounts of files

2009-12-11 Thread Martin Schröder
2009/12/12 STeve Andre' and...@msu.edu:
   I am wondering if there is a port or otherwise available
 code which is good at comparing large numbers of files in
 an arbitrary number of directories?  I always try avoid

Try rsync if you just want to know which files differ.

Best
Martin



Re: Comparing large amounts of files

2009-12-11 Thread Noah Pugsley

STeve Andre' wrote:

   I am wondering if there is a port or otherwise available
code which is good at comparing large numbers of files in
an arbitrary number of directories?  I always try avoid
wheel re-creation when possible.  I'm trying to help some-
one with large piles of data, most of which is identical
across N directories.  Most.  Its the 'across dirs' part
that involves the effort, hence my avoidance of thinking
on it if I can help it. ;-)

Thanks, STeve Andre'



Compare how?




Re: Comparing large amounts of files

2009-12-11 Thread anonymous
On Fri, Dec 11, 2009 at 06:24:24PM -0500, STeve Andre' wrote:
I am wondering if there is a port or otherwise available
 code which is good at comparing large numbers of files in
 an arbitrary number of directories?  I always try avoid
 wheel re-creation when possible.  I'm trying to help some-
 one with large piles of data, most of which is identical
 across N directories.  Most.  Its the 'across dirs' part
 that involves the effort, hence my avoidance of thinking
 on it if I can help it. ;-)
 
 Thanks, STeve Andre'
 
What is wrong with diff (-r option)?



Re: Comparing large amounts of files

2009-12-11 Thread Paul M

Diff (1), if you want to compare specific files or dirs, or
fdupes for searching for arbitrary files in arbitrary locations.


paulm


On 12/12/2009, at 12:24 PM, STeve Andre' wrote:


   I am wondering if there is a port or otherwise available
code which is good at comparing large numbers of files in
an arbitrary number of directories?  I always try avoid
wheel re-creation when possible.  I'm trying to help some-
one with large piles of data, most of which is identical
across N directories.  Most.  Its the 'across dirs' part
that involves the effort, hence my avoidance of thinking
on it if I can help it. ;-)

Thanks, STeve Andre'




Re: Comparing large amounts of files

2009-12-11 Thread STeve Andre'
On Friday 11 December 2009 18:36:33 Noah Pugsley wrote:
 STeve Andre' wrote:
 I am wondering if there is a port or otherwise available
  code which is good at comparing large numbers of files in
  an arbitrary number of directories?  I always try avoid
  wheel re-creation when possible.  I'm trying to help some-
  one with large piles of data, most of which is identical
  across N directories.  Most.  Its the 'across dirs' part
  that involves the effort, hence my avoidance of thinking
  on it if I can help it. ;-)
  
  Thanks, STeve Andre'
  
 
 Compare how?

I should have been more clear I suppose.  I'd like to know
the files that are identical, files that are of the same
name but different across directories, possibly several
directories.

What I have is a large clump of data in the form of some
huge number of reletively small files, which were extracted
out of a database as individual files.  I am not responsible
for this(!) but am trying to come up with a reasonable way
of spotting duplicates, etc.  Some files have the same
name (and even some with the same size) but are different.
It's a mess, but the original database died and all I have
are peices, kind of like shards from a large piece of pottery
that just got smashed.  I'm not even sure what all the data
looks like at this point--I can only assume its going to be
ugly, no thought about this when the files were created.

--STeve Andre'



Re: Comparing large amounts of files

2009-12-11 Thread Alexander Bochmann
Hi,

...on Fri, Dec 11, 2009 at 06:52:09PM -0500, STeve Andre' wrote:

   Compare how?
  I should have been more clear I suppose.  I'd like to know
  the files that are identical, files that are of the same
  name but different across directories, possibly several
  directories.

Maybe you could use something like this in the 
directory you're looking at:

find . -type f -print0 | xargs -0 -r -n 100 md5 -r  md5sums

You could now just sort the md5sums file to find 
all entries with the same md5... Or sort by filename 
(will need some more logic if files are distributed 
over several subdirectories) to weed out those with 
the same name and different checksums.

Alex.



Re: Comparing large amounts of files

2009-12-11 Thread STeve Andre'
On Friday 11 December 2009 20:31:54 Alexander Bochmann wrote:
 Hi,
 
 ...on Fri, Dec 11, 2009 at 06:52:09PM -0500, STeve Andre' wrote:
 
Compare how?
   I should have been more clear I suppose.  I'd like to know
   the files that are identical, files that are of the same
   name but different across directories, possibly several
   directories.
 
 Maybe you could use something like this in the 
 directory you're looking at:
 
 find . -type f -print0 | xargs -0 -r -n 100 md5 -r  md5sums
 
 You could now just sort the md5sums file to find 
 all entries with the same md5... Or sort by filename 
 (will need some more logic if files are distributed 
 over several subdirectories) to weed out those with 
 the same name and different checksums.
 
 Alex.
 
 
 

Yup!  I'm doing that for single directories now.  But the
added logic for N dirs was something I hoped to avoid.

--STeve



Re: Comparing large amounts of files

2009-12-11 Thread Brad Tilley
On Fri, Dec 11, 2009 at 8:31 PM, Alexander Bochmann a...@lists.gxis.de wrote:

 find . -type f -print0 | xargs -0 -r -n 100 md5 -r  md5sums

 You could now just sort the md5sums file to find
 all entries with the same md5... Or sort by filename
 (will need some more logic if files are distributed
 over several subdirectories) to weed out those with
 the same name and different checksums.

 Alex.

I do something similar, but more elaborate, using Python to backup
redundant pics scattered in various folders into one folder... would
need to be modified for name clashes:

already = []
dst = os.getcwd()

paths = [/usr/local, /home, /storage]

for p in paths:

for root, dirs, files in os.walk(p):
for f in files:

m = hashlib.md5()

# Get file extension
ext = os.path.splitext(os.path.join(root, f))[1]

try:

# Copy JPG files
if ext.lower() == .jpg:
fp = open(os.path.join(root, f),'rb')
data = fp.read()
fp.close()
m.update(data)
if m.hexdigest() not in already:
already.append(m.hexdigest())
print Copying, os.path.join(root,f)
shutil.copyfile(os.path.join(root,f),
os.path.join(dst,f))
else:
print Already Copied!!!
...



Re: Comparing large amounts of files

2009-12-11 Thread Bret S. Lambert
On Sat, Dec 12, 2009 at 02:31:54AM +0100, Alexander Bochmann wrote:
 Hi,
 
 ...on Fri, Dec 11, 2009 at 06:52:09PM -0500, STeve Andre' wrote:
 
Compare how?
   I should have been more clear I suppose.  I'd like to know
   the files that are identical, files that are of the same
   name but different across directories, possibly several
   directories.
 
 Maybe you could use something like this in the 
 directory you're looking at:
 
 find . -type f -print0 | xargs -0 -r -n 100 md5 -r  md5sums
 
 You could now just sort the md5sums file to find 

or | to sort -u -k smizzizlezlackin' awesomenesss

 all entries with the same md5... Or sort by filename 
 (will need some more logic if files are distributed 
 over several subdirectories) to weed out those with 
 the same name and different checksums.
 
 Alex.



Re: Comparing large amounts of files

2009-12-11 Thread STeve Andre'
On Friday 11 December 2009 19:11:18 anonymous wrote:
 On Fri, Dec 11, 2009 at 06:24:24PM -0500, STeve Andre' wrote:
 I am wondering if there is a port or otherwise available
  code which is good at comparing large numbers of files in
  an arbitrary number of directories?  I always try avoid
  wheel re-creation when possible.  I'm trying to help some-
  one with large piles of data, most of which is identical
  across N directories.  Most.  Its the 'across dirs' part
  that involves the effort, hence my avoidance of thinking
  on it if I can help it. ;-)
  
  Thanks, STeve Andre'
  
 What is wrong with diff (-r option)?

Diff doesn't look at N directories at the same time, and I
don't think it deals with both of same data different names,
or same names different data.  Its a mess, which is why I'm
asking about a general tool for large piles of dreck.

--STeve Andre'



Re: Comparing large amounts of files

2009-12-11 Thread Frank Bax

STeve Andre' wrote:

but am trying to come up with a reasonable way
of spotting duplicates, etc.



You mean like this...

$ cp /etc/firmware/zd1211-license /tmp/XX1
$ cp /var/www/icons/dir.gif /tmp/XX2
$ fdupes /etc/firmware/ /var/www/icons/ /tmp/
/tmp/XX2
/var/www/icons/dir.gif
/var/www/icons/folder.gif

/tmp/XX1
/etc/firmware/zd1211-licence
/etc/firmware/zd1211-license

/var/www/icons/uuencoded.png
/var/www/icons/uu.png

/var/www/icons/uuencoded.gif
/var/www/icons/uu.gif

/var/www/icons/folder.png
/var/www/icons/dir.png



Re: Comparing large amounts of files

2009-12-11 Thread bofh
On 12/11/09, STeve Andre' and...@msu.edu wrote:
 I should have been more clear I suppose.  I'd like to know
 the files that are identical, files that are of the same
 name but different across directories, possibly several
 directories.

Unison is in ports.  Enjoy :)



-- 
http://www.glumbert.com/media/shift
http://www.youtube.com/watch?v=tGvHNNOLnCk
This officer's men seem to follow him merely out of idle curiosity.
-- Sandhurst officer cadet evaluation.
Securing an environment of Windows platforms from abuse - external or
internal - is akin to trying to install sprinklers in a fireworks
factory where smoking on the job is permitted.  -- Gene Spafford
learn french:  http://www.youtube.com/watch?v=30v_g83VHK4