Hi,

I want to use difflib to compare a lot (tens of thousands) of text files. I 
know that many files are quite similar as they are subsequent versions of the 
same document (a primitive kind of version control). What would be a good 
approach to cluster the files based on their likeness? I want to be able to say 
something like: the number of files could be reduced by a factor of ten when 
the number of (near-)duplicates is taken into account.

So let's say I have ten versions of a txt file: 'file0.txt', 'file1.txt', 
'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt', 'file6.txt', 'file7.txt', 
'file8.txt', 'file9.txt'. How could I to some degree of certainty say they are 
related (I can't rely on the file names I'm affraid). file0 may be very similar 
to file1, but no longer to file10. But their likeness is "chained". The 
situation is easier with perfectly identical files.


The crude code below illustrates what I'd like to do, but it's too simplistic. 
I'd appreciate some thoughts or references to theoretical approaches to this 
kind of stuff.


import difflib, glob, os

path = "/home/aj/Destkop/someDir"
extension = ".txt"
cut_off = 0.95

allTheFiles = sorted(glob.glob(os.path.join(path, "*" + extension)))

for f_a in allTheFiles:
  for f_b in allTheFiles:
    file_a = open(f_a).readlines()
    file_b = open(f_b).readlines()
    if f_a != f_b:

       likeness = difflib.SequenceMatcher(lambda x: x == " ", file_a, 
file_b).ratio()
       if likeness >= cut_off:
         try:
           clusters[f_a].append(f_b)
         except KeyError:
           clusters[f_a] = [f_b]

 
Thank you in advance!


Regards,
Albert-Jan


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All right, but apart from the sanitation, the medicine, education, wine, public 
order, irrigation, roads, a 
fresh water system, and public health, what have the Romans ever done for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to