Re: diff 20 files that are mostly equal

Bob Proulx Tue, 14 Apr 2009 09:29:20 -0700

avilella wrote:
> I would like to compare ~20 files that are mostly the same, but some
> of them have 2-3 different lines in a couple of places. I can do a
> diff for every pair, but I bould like to have one representation for
> all files that is a consensus file then with extra tagged lines for
> the differences. Is there any tool that does that? What would people
> recommend?


I don't know of any tool that does that directly.  And I think
diff'ing every pair could generate a lot of messy output.

What I tend to do in those types of situations is to run md5sum (or
any of the *sum utilities) on the entire list of files.  Then sort by
the signature.  Files that are identical will have identical
signatures and will be grouped together.  Files that are different
will be listed apart from them.  Also the 'uniq -c' utility can count
and produce a count of identical.  Sort can then be applied to this
output and the files that have the most identical copies will be
identified and files with fewer instances identified.

  $ md5sum ./* | sort -k1,1
  118721e880107e6bac4d8b6f42c472d4  ./5
  118721e880107e6bac4d8b6f42c472d4  ./6
  29c450ee7a45cf7aa4e8ebe165925fd5  ./7
  3e234925eeb1b48960dcbf43050f4b23  ./1
  3e234925eeb1b48960dcbf43050f4b23  ./2
  3e234925eeb1b48960dcbf43050f4b23  ./3
  3e234925eeb1b48960dcbf43050f4b23  ./4

  $ md5sum ./* | sort -k1,1 | awk '{print$1}' | uniq -c
  2 118721e880107e6bac4d8b6f42c472d4
  1 29c450ee7a45cf7aa4e8ebe165925fd5
  4 3e234925eeb1b48960dcbf43050f4b23

  $ md5sum ./* | sort -k1,1 | awk '{print$1}' | uniq -c | sort -nr
  4 3e234925eeb1b48960dcbf43050f4b23
  2 118721e880107e6bac4d8b6f42c472d4
  1 29c450ee7a45cf7aa4e8ebe165925fd5

Perhaps something like that might be useful for you as well?

Bob

Re: diff 20 files that are mostly equal

Reply via email to