On 2009-02-05 02:20, Nick Matzke wrote: > Hi all, > > So I have an interesting challenge. I want to compare two book > chapters, which I have in plain text format, and find out (a) percentage > similarity and (b) what has changed. > > Some features make this problem different than what seems to be the > standard text-matching problem solvable with e.g. difflib. Here is what > I mean: > > * there is no guarantee that single lines from each file will be > directly comparable -- e.g., if a few words are inserted into a > sentence, then a chunk of the sentence will be moved to the next line, > then a chunk of that line moved to the next, etc. > > * Also, there are cases where paragraphs have been moved around, > sections re-ordered, etc. So it can't just be a "linear" match. > > I imagine this kind of thing can't be all that hard in the grand scheme > of things, but I couldn't find an easily applicable solution readily > available. I have advanced beginner python skills but am not quite > where I could do this kind of thing from scratch without some guidance > about the likely functions, libraries etc. to use. > > PS: I am going to have to do this for multiple book chapters so various > software packages, e.g. for windows, are not really usable. > > Any help is much appreciated!!
difflib is in the Python stdlib and provides many ways to implement difference detection: http://docs.python.org/library/difflib.html Here's a script that I use for diff'ing text files on a word basis, called tdiff.py: http://downloads.egenix.com/python/tdiff.py It helps a lot with text that gets word wrapped or reformatted. Cheers, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 05 2009) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- http://mail.python.org/mailman/listinfo/python-list