here's a large exercise that uses what we built before. suppose you have tens of thousands of files in various directories. Some of these files are identical, but you don't know which ones are identical with which. Write a program that prints out which file are redundant copies.
Here's the spec. -------------------------- The program is to be used on the command line. Its arguments are one or more full paths of directories. perl del_dup.pl dir1 prints the full paths of all files in dir1 that are duplicate. (including files in sub-directories) More specifically, if file A has duplicates, A's full path will be printed on a line, immediately followed the full paths of all other files that is a copy of A. These duplicates's full paths will be prefixed with "rm " string. A empty line follows a group of duplicates. Here's a sample output. inPath/a.jpg rm inPath/b.jpg rm inPath/3/a.jpg rm inPath/hh/eu.jpg inPath/ou.jpg rm inPath/23/a.jpg rm inPath/hh33/eu.jpg order does not matter. (i.e. which file will not be "rm " does not matter.) ------------------------ perl del_dup.pl dir1 dir2 will do the same as above, except that duplicates within dir1 or dir2 themselves not considered. That is, all files in dir1 are compared to all files in dir2. (including subdirectories) And, only files in dir2 will have the "rm " prefix. One way to understand this is to imagine lots of image files in both dir. One is certain that there are no duplicates within each dir themselves. (imagine that del_dup.pl has run on each already) Files in dir1 has already been categorized into sub directories by human. So that when there are duplicates among dir1 and dir2, one wants the version in dir2 to be deleted, leaving the organization in dir1 intact. perl del_dup.pl dir1 dir2 dir3 ... does the same as above, except files in later dir will have "rm " first. So, if there are these identical files: dir2/a dir2/b dir4/c dir4/d the c and d will both have "rm " prefix for sure. (which one has "rm " in dir2 does not matter) Note, although dir2 doesn't compare files inside itself, but duplicates still may be implicitly found by indirect comparison. i.e. a==c, b==c, therefore a==b, even though a and b are never compared. -------------------------- Write a Perl or Python version of the program. a absolute requirement in this problem is to minimize the number of comparison made between files. This is a part of the spec. feel free to write it however you want. I'll post my version in a few days. http://www.xahlee.org/perl-python/python.html Xah [EMAIL PROTECTED] http://xahlee.org/PageTwo_dir/more.html -- http://mail.python.org/mailman/listinfo/python-list