Re: [Tutor] How to identify clusters of similar files

Albert-Jan Roskam Sun, 03 Jun 2012 03:41:17 -0700



From: Steven D'Aprano <st...@pearwood.info>
>To: Python Mailing List <tutor@python.org> 
>Sent: Sunday, June 3, 2012 4:00 AM
>Subject: Re: [Tutor] How to identify clusters of similar files
> 
>Albert-Jan Roskam wrote:
>> Hi,
>> 
>> I want to use difflib to compare a lot (tens of thousands) of text files. I
>> know that many files are quite similar as they are subsequent versions of
>> the same document (a primitive kind of version control). What would be a
>> good approach to cluster the files based on their likeness?
>
>You have already identified the basic tool: difflib. But your question is not 
>really about Python, it is more about the algorithm used for clustering data 
>according to goodness of fit. That's a hard problem, and you should consider 
>asking it on the main Python mailing list or newsgroup too.
>
>Some search terms to get you started:
>
>biopython
>nltk  (the Natural Language Tool Kit)
>unrooted phylogram
>
>
>Good luck!
>
>
>-- Steven
>
>Hi Steven,
>
>Thanks! Biopython looks very interesting. While browsing I was thinking this 
>problem could also be considered as a probabilistic/fuzzy linkage problem 
>(Fellegi & Sunter). Instead of linking records, I am trying to 'link'  files.
>
>
>Best wishes,
>Albert-Jan
>

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to identify clusters of similar files

Reply via email to