Re: Comparing large amounts of files
Sorry for the additional post, but it's a fun problem when you consider the possible messed up file/path names that could be created by a raw database dump. I made some improvements to the example script to make sure messed up file/path names are handled correctly. #!/bin/ksh # There are plenty of caveats to be aware of to make sure you process # strings (paths and file names) correctly. # # The first is the built-in ksh 'echo' will strip backslashes by # default, unless used with the '-E' switch. This is the opposite of # bash where the backslashes are not stripped by default, but the '-e' # switch forces stripping. The easy answer is to just use printf(1), # rather than worry if you got the 'echo' correct. # # Another caveat is when using 'while read', you need to use the '-r' # flag to prevent stripping of backslashes. # # Also, quoting your variables is highly recommended. This prevents a # number of issues including odd behavior due to spaces in strings. # --- test setup # clean up rm -rf d1 d2 d3 rm -rf d\ 4 rm -rf d 5 rm -rf ./-d rm -rf 'x\\ 2' rm -rf '*x' rm -rf 'xx' rm all.txt unq.txt dup.txt # We need some seriously messed up directory names. mkdir d1 d2 d3 d\ 4 d 5 ./-d 'x\\ 2' '*x' 'xx' # And we need some seriously messed up file names in each directory. for i in `ls -1 ../0[1,2,3,4,5].png`; do cp $i ./d1/`printf %s $i | sed -e 's/0/1/' -e 's/^..\///'`; cp $i ./d2/`printf %s $i | sed -e 's/0/2/' -e 's/^..\///'`; cp $i ./d3/`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; cp $i ./d\ 4/`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; cp $i ./d 5/`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; cp $i ./-d/`printf %s $i | sed -e 's/0/-/' -e 's/^..\///'`; cp $i './x\\ 2/'`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; cp $i './*x/'`printf %s $i | sed -e 's/0/3/' -e 's/^..\///'`; cp $i './xx/'`printf %s $i | sed -e 's/0/?/' -e 's/^..\///'`; done; # --- Actually Useful --- # Find all the files, and generate cksum for each find ./d1 ./d2 ./d3 ./d\ 4/ ./d 5/ ./-d/ './x\\ 2/' '*x' 'xx' \ -type f -print0 | xargs -0 cksum -r -a sha256 all.txt # For the sake of your sanity, you want the leading ./ or better, # the fully qualified path to the directories where you want to # run find(1). This will save you from a lot of possible mistakes # caused by messed up directory and/or file names. # After you have a cksum hash for every file, you want to make sure # your list is sorted, or else other commands will fail since they # typically expect lists to be pre-sorted. sort -k 1,1 -o all.txt all.txt # Generate a list of unique files based on the hash sort -k 1,1 -u -o unq.txt all.txt # To get a list of just the duplicates using comm(1) comm -2 -3 all.txt unq.txt dup.txt # --- Possibly Useful --- # Sure, once you have your list of duplicates, you're pretty well set # assuming you're not afraid of hash collisions. If the data is very # valuable, and you don't want to risk a hash collision causing you to # delete something important, you need to use cmp(1) to make sure the # files really are *idendical* duplicates. IFS=' ' for UNQ_LINE in `cat unq.txt`; do # Grab the hash from the line. UNQ_HASH=`printf %s $UNQ_LINE | sed -E -e 's/ .+$//'`; # Grab the full path and file name. UNQ_FILE=`printf %s $UNQ_LINE | sed -E -e 's/^[a-f0-9]{64,64} //'`; printf \n; printf UNQ_HASH: %s\n $UNQ_HASH; printf UNQ_FILE: %s\n $UNQ_FILE; # use the look(1) command to find matching hashes in the duplicates for DUP_LINE in `look $UNQ_HASH dup.txt`; do DUP_HASH=`printf %s $DUP_LINE | sed -E -e 's/ .+$//'`; DUP_FILE=`printf %s $DUP_LINE | sed -E -e 's/^[a-f0-9]{64,64} //'`; printf DUP_HASH: %s\n $DUP_HASH; printf DUP_FILE: %s\n $DUP_FILE; if `cmp $UNQ_FILE $DUP_FILE`; then printf Binary Compare Matches\n; # rm $DUP_FILE; else printf ERROR: Hash Collision\n; exit 1; fi done; done; # Another way to do it, without using sed. # Reset the IFS back to space IFS=' '; cat unq.txt | while read -r UNQ_HASH UNQ_FILE; do printf \n; printf UNQ_HASH: %s\n $UNQ_HASH; printf UNQ_FILE: %s\n $UNQ_FILE; look $UNQ_HASH dup.txt | while read -r DUP_HASH DUP_FILE; do printf DUP_HASH: %s\n $DUP_HASH; printf DUP_FILE: %s\n $DUP_FILE; if `cmp $UNQ_FILE $DUP_FILE`; then printf Binary Compare Matches\n; # rm $DUP_FILE; else printf ERROR: Hash Collision\n; exit 1; fi done; done; exit 0 -- J.C. Roberts
Re: Comparing large amounts of files
On Fri, 11 Dec 2009 18:24:24 -0500 STeve Andre' and...@msu.edu wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre' Sorry for the late reply STeve, but this might still be helpful for you, or possibly others through the list archives. Though it doesn't come up too often, you've asked a fairly classic question. Most people resort to finding some pre-made tool for this sort of thing, simply because they don't know that everything they need is already in the base OpenBSD operating system. Considering your earlier post about moving *huge* files (greater than 10 GiB), the method of file comparison you use could be real taxing on the system. Our cksum(1) command gives you plenty of possible trade-offs between accuracy (uniqueness) versus speed for various check sum (hash) algorithms. Assuming you don't want to accidentally delete the wrong data, SHA256 might be a good choice, but there is still the potential for hash collisions. Using two or more hashes (such as the once typical combination of MD5 and SHA256), only *reduces* the chance of collision, rather than *eliminating* the chance of collision. If you absolutely must be certain you're not mistakenly deleting something due to a hash collision, then you need to use the cmp(1) command before deleting anything. Since you're dealing with a somewhat raw dump of a database, you could have some potentially dangerous directory or file names mixed into the resulting dump. Handling these properly is painful, requires some thinking, and I may not have it completely correct. #!/bin/ksh # Create a test/ directory mkdir ./test # Copy some random image files into our new test/ directory cp ../0[1,2,3,4,5].png test/. # $ ls -lF test/ # total 1696 # -rw-r--r-- 1 jcr users 126473 Dec 28 00:02 01.png # -rw-r--r-- 1 jcr users 157483 Dec 28 00:02 02.png # -rw-r--r-- 1 jcr users 209387 Dec 28 00:02 03.png # -rw-r--r-- 1 jcr users 163780 Dec 28 00:02 04.png # -rw-r--r-- 1 jcr users 188546 Dec 28 00:02 05.png # Now we need some messed up directory names. mkdir d1 d2 d3 d\ 4 d 5 ./-d # And we need some messed up file names in each directory. for i in test/*.png; do cp $i ./d1/`echo $i | sed -e 's/0/1/' -e 's/^test\///'`; cp $i ./d2/`echo $i | sed -e 's/0/2/' -e 's/^test\///'`; cp $i ./d3/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`; cp $i ./d\ 4/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`; cp $i ./d 5/`echo $i | sed -e 's/0/3/' -e 's/^test\///'`; cp $i ./-d/`echo $i | sed -e 's/0/-/' -e 's/^test\///'`; done; # Find all the files, and generate cksum for each find ./d1 ./d2 ./d3 ./d\ 4/ ./d 5/ ./-d/ -type f -print0 | \ xargs -0 cksum -r -a sha256 all.txt # For the sake of your sanity, you want the leading ./ or better, # the fully qualified path to the directories where you want to # run find(1). This will save you from a lot of possible mistakes # caused by screwed up directory and/or file names. # After you have a cksum hash for every file, you want to make sure # your list is sorted, or else other commands will fail since they # typically expect lists to be pre-sorted. sort -k 1,1 -o all.txt all.txt # Generate a list of unique files based on the hash sort -k 1,1 -u -o unq.txt all.txt # To get a list of just the duplicates using comm(1) comm -2 -3 all.txt unq.txt dup.txt # Sure, once you have your list of duplicates, you're pretty well set # assuming you're not afraid of hash collisions. If the data is very # valuable, and you don't want to risk a hash collision causing you to # delete something important, you need to use cmp(1) to make sure the # files really are *idendical* duplicates. # The `while read VAR;do ...; done file.txt;` construct fails when # a backslash followed by a space is present in the input file. In # essence it seems to be escaping the space, and thereby dropping the # leading backslash. i.e. # ./d\ 4/file.png # # Since we can't trust `while read ...` to do the right thing, we have # to resort to resetting the Internal Field Separator (IFS) to a new # line, so we can grab entire lines. IFS=' ' # As for doing a binary compare with cmp(1), it's fairly simple now. for UNQ_LINE in `cat unq.txt`; do # Grab the hash from the line. UNQ_HASH=`echo $UNQ_LINE | sed -E -e 's/ .+$//'`; # Grab the full path and file name. UNQ_FILE=`echo $UNQ_LINE | sed -E -e 's/^[a-f0-9]{64,64} //'`; # use the look(1) command to find matching hashes in the duplicates # or you could use grep(1) or whatever else fits your fancy. for
Re: Comparing large amounts of files
On Mon, 28 Dec 2009 18:40:57 -0800 J.C. Roberts list-...@designtools.org wrote: # The `while read VAR;do ...; done file.txt;` construct fails when # a backslash followed by a space is present in the input file. In # essence it seems to be escaping the space, and thereby dropping the # leading backslash. i.e. # ./d\ 4/file.png # The above is because I'm an idiot and forgot the '-r' flag for 'read' -- J.C. Roberts
Re: Comparing large amounts of files
On Saturday, December 12, 2009, Andy Hayward a...@buteo.org wrote: On Fri, Dec 11, 2009 at 23:24, STeve Andre' and...@msu.edu wrote: B I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? B I always try avoid wheel re-creation when possible. B I'm trying to help some- one with large piles of data, most of which is identical across N directories. B Most. B Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) sysutils/fdupes -- ach If you have a database available yo can store file hashes and use SQL. I used postgres for the job and had reasonable performance on a 10 million file collection. I stored directory paths in one table and filename, size, and sha1 in another table. Scripting the table creation was fairly easy... -N
Re: Comparing large amounts of files
On 12/12/2009, at 4:22 PM, Frank Bax wrote: STeve Andre' wrote: but am trying to come up with a reasonable way of spotting duplicates, etc. You mean like this... $ cp /etc/firmware/zd1211-license /tmp/XX1 $ cp /var/www/icons/dir.gif /tmp/XX2 $ fdupes /etc/firmware/ /var/www/icons/ /tmp/ /tmp/XX2 /var/www/icons/dir.gif /var/www/icons/folder.gif snip When comparing a very large number of files, the -1 flag to fdupes makes it much to manage the output (IMO). Then each line printed is a list of duplicate files. But it depends on your own needs. After seeing what you're trying to do, I'd say fdupes is a perfect fit. paulm
Re: Comparing large amounts of files
On 11 December 2009, STeve Andre' and...@msu.edu wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Try this tiny Perl script: http://hqbox.org/files/fdupe.pl It's still faster than all of its competitors I'm aware of (most of them written in C). :) Regards, Liviu Daia -- Dr. Liviu Daia http://www.imar.ro/~daia
Re: Comparing large amounts of files
On Fri, Dec 11, 2009 at 23:24, STeve Andre' and...@msu.edu wrote: B I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? B I always try avoid wheel re-creation when possible. B I'm trying to help some- one with large piles of data, most of which is identical across N directories. B Most. B Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) sysutils/fdupes -- ach
Comparing large amounts of files
I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre'
Re: Comparing large amounts of files
2009/12/12 STeve Andre' and...@msu.edu: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid Try rsync if you just want to know which files differ. Best Martin
Re: Comparing large amounts of files
STeve Andre' wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre' Compare how?
Re: Comparing large amounts of files
On Fri, Dec 11, 2009 at 06:24:24PM -0500, STeve Andre' wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre' What is wrong with diff (-r option)?
Re: Comparing large amounts of files
Diff (1), if you want to compare specific files or dirs, or fdupes for searching for arbitrary files in arbitrary locations. paulm On 12/12/2009, at 12:24 PM, STeve Andre' wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre'
Re: Comparing large amounts of files
On Friday 11 December 2009 18:36:33 Noah Pugsley wrote: STeve Andre' wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre' Compare how? I should have been more clear I suppose. I'd like to know the files that are identical, files that are of the same name but different across directories, possibly several directories. What I have is a large clump of data in the form of some huge number of reletively small files, which were extracted out of a database as individual files. I am not responsible for this(!) but am trying to come up with a reasonable way of spotting duplicates, etc. Some files have the same name (and even some with the same size) but are different. It's a mess, but the original database died and all I have are peices, kind of like shards from a large piece of pottery that just got smashed. I'm not even sure what all the data looks like at this point--I can only assume its going to be ugly, no thought about this when the files were created. --STeve Andre'
Re: Comparing large amounts of files
Hi, ...on Fri, Dec 11, 2009 at 06:52:09PM -0500, STeve Andre' wrote: Compare how? I should have been more clear I suppose. I'd like to know the files that are identical, files that are of the same name but different across directories, possibly several directories. Maybe you could use something like this in the directory you're looking at: find . -type f -print0 | xargs -0 -r -n 100 md5 -r md5sums You could now just sort the md5sums file to find all entries with the same md5... Or sort by filename (will need some more logic if files are distributed over several subdirectories) to weed out those with the same name and different checksums. Alex.
Re: Comparing large amounts of files
On Friday 11 December 2009 20:31:54 Alexander Bochmann wrote: Hi, ...on Fri, Dec 11, 2009 at 06:52:09PM -0500, STeve Andre' wrote: Compare how? I should have been more clear I suppose. I'd like to know the files that are identical, files that are of the same name but different across directories, possibly several directories. Maybe you could use something like this in the directory you're looking at: find . -type f -print0 | xargs -0 -r -n 100 md5 -r md5sums You could now just sort the md5sums file to find all entries with the same md5... Or sort by filename (will need some more logic if files are distributed over several subdirectories) to weed out those with the same name and different checksums. Alex. Yup! I'm doing that for single directories now. But the added logic for N dirs was something I hoped to avoid. --STeve
Re: Comparing large amounts of files
On Fri, Dec 11, 2009 at 8:31 PM, Alexander Bochmann a...@lists.gxis.de wrote: find . -type f -print0 | xargs -0 -r -n 100 md5 -r md5sums You could now just sort the md5sums file to find all entries with the same md5... Or sort by filename (will need some more logic if files are distributed over several subdirectories) to weed out those with the same name and different checksums. Alex. I do something similar, but more elaborate, using Python to backup redundant pics scattered in various folders into one folder... would need to be modified for name clashes: already = [] dst = os.getcwd() paths = [/usr/local, /home, /storage] for p in paths: for root, dirs, files in os.walk(p): for f in files: m = hashlib.md5() # Get file extension ext = os.path.splitext(os.path.join(root, f))[1] try: # Copy JPG files if ext.lower() == .jpg: fp = open(os.path.join(root, f),'rb') data = fp.read() fp.close() m.update(data) if m.hexdigest() not in already: already.append(m.hexdigest()) print Copying, os.path.join(root,f) shutil.copyfile(os.path.join(root,f), os.path.join(dst,f)) else: print Already Copied!!! ...
Re: Comparing large amounts of files
On Sat, Dec 12, 2009 at 02:31:54AM +0100, Alexander Bochmann wrote: Hi, ...on Fri, Dec 11, 2009 at 06:52:09PM -0500, STeve Andre' wrote: Compare how? I should have been more clear I suppose. I'd like to know the files that are identical, files that are of the same name but different across directories, possibly several directories. Maybe you could use something like this in the directory you're looking at: find . -type f -print0 | xargs -0 -r -n 100 md5 -r md5sums You could now just sort the md5sums file to find or | to sort -u -k smizzizlezlackin' awesomenesss all entries with the same md5... Or sort by filename (will need some more logic if files are distributed over several subdirectories) to weed out those with the same name and different checksums. Alex.
Re: Comparing large amounts of files
On Friday 11 December 2009 19:11:18 anonymous wrote: On Fri, Dec 11, 2009 at 06:24:24PM -0500, STeve Andre' wrote: I am wondering if there is a port or otherwise available code which is good at comparing large numbers of files in an arbitrary number of directories? I always try avoid wheel re-creation when possible. I'm trying to help some- one with large piles of data, most of which is identical across N directories. Most. Its the 'across dirs' part that involves the effort, hence my avoidance of thinking on it if I can help it. ;-) Thanks, STeve Andre' What is wrong with diff (-r option)? Diff doesn't look at N directories at the same time, and I don't think it deals with both of same data different names, or same names different data. Its a mess, which is why I'm asking about a general tool for large piles of dreck. --STeve Andre'
Re: Comparing large amounts of files
STeve Andre' wrote: but am trying to come up with a reasonable way of spotting duplicates, etc. You mean like this... $ cp /etc/firmware/zd1211-license /tmp/XX1 $ cp /var/www/icons/dir.gif /tmp/XX2 $ fdupes /etc/firmware/ /var/www/icons/ /tmp/ /tmp/XX2 /var/www/icons/dir.gif /var/www/icons/folder.gif /tmp/XX1 /etc/firmware/zd1211-licence /etc/firmware/zd1211-license /var/www/icons/uuencoded.png /var/www/icons/uu.png /var/www/icons/uuencoded.gif /var/www/icons/uu.gif /var/www/icons/folder.png /var/www/icons/dir.png
Re: Comparing large amounts of files
On 12/11/09, STeve Andre' and...@msu.edu wrote: I should have been more clear I suppose. I'd like to know the files that are identical, files that are of the same name but different across directories, possibly several directories. Unison is in ports. Enjoy :) -- http://www.glumbert.com/media/shift http://www.youtube.com/watch?v=tGvHNNOLnCk This officer's men seem to follow him merely out of idle curiosity. -- Sandhurst officer cadet evaluation. Securing an environment of Windows platforms from abuse - external or internal - is akin to trying to install sprinklers in a fireworks factory where smoking on the job is permitted. -- Gene Spafford learn french: http://www.youtube.com/watch?v=30v_g83VHK4