Re: Comparing large amounts of files

J.C. Roberts Mon, 28 Dec 2009 19:17:26 -0800

On Fri, 11 Dec 2009 18:24:24 -0500 "STeve Andre'" <[email protected]>
wrote:


>    I am wondering if there is a port or otherwise available
> code which is good at comparing large numbers of files in
> an arbitrary number of directories?  I always try avoid
> wheel re-creation when possible.  I'm trying to help some-
> one with large piles of data, most of which is identical
> across N directories.  Most.  Its the 'across dirs' part
> that involves the effort, hence my avoidance of thinking
> on it if I can help it. ;-)
> 
> Thanks, STeve Andre'
> 

Sorry for the late reply STeve, but this might still be helpful for you,
or possibly others through the list archives. Though it doesn't come up
too often, you've asked a fairly classic question. Most people resort
to finding some pre-made "tool" for this sort of thing, simply because
they don't know that everything they need is already in the base OpenBSD
operating system.

Considering your earlier post about moving *huge* files (greater than
10 GiB), the method of file comparison you use could be real taxing on
the system. Our cksum(1) command gives you plenty of possible
trade-offs between accuracy (uniqueness) versus speed for various
check sum (hash) algorithms. 

Assuming you don't want to accidentally delete the wrong data, SHA256
might be a good choice, but there is still the potential for hash
collisions. Using two or more hashes (such as the once typical
combination of MD5 and SHA256), only *reduces* the chance of collision,
rather than *eliminating* the chance of collision. If you absolutely
must be certain you're not mistakenly deleting something due to a hash
collision, then you need to use the cmp(1) command before deleting
anything.

Since you're dealing with a somewhat raw dump of a database, you could
have some potentially "dangerous" directory or file names mixed into
the resulting dump. Handling these properly is painful, requires some
thinking, and I may not have it completely correct.

********************************************************************
#!/bin/ksh

# Create a test/ directory
mkdir "./test"

# Copy some random image files into our new test/ directory
cp ../0[1,2,3,4,5].png "test/."

# $ ls -lF test/
# total 1696
# -rw-r--r--  1 jcr  users  126473 Dec 28 00:02 01.png
# -rw-r--r--  1 jcr  users  157483 Dec 28 00:02 02.png
# -rw-r--r--  1 jcr  users  209387 Dec 28 00:02 03.png
# -rw-r--r--  1 jcr  users  163780 Dec 28 00:02 04.png
# -rw-r--r--  1 jcr  users  188546 Dec 28 00:02 05.png

# Now we need some messed up directory names.
mkdir d1 d2 d3 "d\ 4" "d 5" "./-d"

# And we need some messed up file names in each directory.
for i in test/*.png; do 
  cp "$i" "./d1/`echo $i | sed -e 's/0/1/' -e 's/^test\///'`"; 
  cp "$i" "./d2/`echo $i | sed -e 's/0/2/' -e 's/^test\///'`"; 
  cp "$i" "./d3/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`"; 
  cp "$i" "./d\ 4/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`"; 
  cp "$i" "./d 5/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`"; 
  cp "$i" "./-d/`echo $i | sed -e 's/0/-/' -e 's/^test\///'`"; 
done;

# Find all the files, and generate cksum for each
find ./d1 ./d2 ./d3 "./d\ 4/" "./d 5/" "./-d/" -type f -print0 | \
  xargs -0 cksum -r -a sha256 >all.txt

# For the sake of your sanity, you want the leading "./" or better,
# the fully qualified path to the directories where you want to
# run find(1). This will save you from a lot of possible mistakes
# caused by screwed up directory and/or file names.

# After you have a cksum hash for every file, you want to make sure
# your list is sorted, or else other commands will fail since they
# typically expect lists to be pre-sorted.
sort -k 1,1 -o all.txt all.txt

# Generate a list of unique files based on the hash
sort -k 1,1 -u -o unq.txt all.txt 

# To get a list of just the duplicates using comm(1)
comm -2 -3 all.txt unq.txt >dup.txt 

# Sure, once you have your list of duplicates, you're pretty well set
# assuming you're not afraid of hash collisions. If the data is very
# valuable, and you don't want to risk a hash collision causing you to
# delete something important, you need to use cmp(1) to make sure the
# files really are *idendical* duplicates.

# The `while read VAR;do ...; done < file.txt;` construct fails when
# a backslash followed by a space is present in the input file. In
# essence it seems to be escaping the space, and thereby dropping the
# leading backslash. i.e.
#     ./d\ 4/file.png
#
# Since we can't trust `while read ...` to do the right thing, we have
# to resort to resetting the Internal Field Separator (IFS) to a new
# line, so we can grab entire lines.
IFS='
'

# As for doing a binary compare with cmp(1), it's fairly simple now.

for UNQ_LINE in `cat unq.txt`; do
  # Grab the hash from the line.
  UNQ_HASH=`echo "$UNQ_LINE" | sed -E -e 's/ .+$//'`;
  # Grab the full path and file name.
  UNQ_FILE=`echo "$UNQ_LINE" | sed -E -e 's/^[a-f0-9]{64,64} //'`;
  # use the look(1) command to find matching hashes in the duplicates
  # or you could use grep(1) or whatever else fits your fancy.
  for DUP_LINE in `look $UNQ_HASH dup.txt`; do
    DUP_HASH=`echo "$DUP_LINE" | sed -E -e 's/ .+$//'`;
    DUP_FILE=`echo "$DUP_LINE" | sed -E -e 's/^[a-f0-9]{64,64} //'`;
    echo ""
    echo "UNQ_LINE: " $UNQ_LINE
    echo "DUP_LINE: " $DUP_LINE
    # Do a binary compare on the two files
    if `cmp "$UNQ_FILE" "$DUP_FILE"`; then
      printf "They Match\n\n";
      # rm "$DUP_FILE"
    else
      printf "Hash Collision\n\n";
      exit 1
    fi
  done;
done;

exit 0
********************************************************************

Yep, the above is overly-verbose and includes creating directories and
files for testing, but it explains a lot more than a one-liner. Now
that I think of it, I doubt I could do it correctly in a one-liner. All
you really need from the above is the find(1) the two sort(1)'s and the
comm(1) lines to generate the needed lists of files (all, unique,
duplicates). --You did really say what you're after (moving unique
files, deleting duplicates, or whatever).

If you're dealing with humongous files like the 10GiB monsters you
previously mentioned, you might want to use a "lighter" hash algorithm
with cksum (sum? sysvsum?) to increase speed since you're doing a binary
compare at the end to make certain you don't mistakenly delete anything
due to a hash collision.

There's a certain degree of "fun" in figuring out how to do things with
just the base install. :-)

-- 
J.C. Roberts

Re: Comparing large amounts of files

Reply via email to