On Thu, 16 Nov 2006 22:59:41 -0800, John Henry wrote: > I encountered a case where I am trying to match "HIDESST1" and > "HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"] > > Well, they both hit "HIDEDST1" as the first match which is not exactly > the result I was looking for. I don't understand why "HIDESCT1" would > not hit "HIDEDCT1" as a first choice.
All those strings are so similar that I think the exact order of "best match" will depend on the precise details of the algorithm. I note that the difflib specifically says: [quote] This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people. [end quote] There are many ways that one can calculate the difference between two strings, e.g. is the string "acab" an "ab" with "ac" inserted at the front, or "ab" with "ca" inserted in the middle? The Unix diff utility may return minimal diffs, but difflib does not make that claim. You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1": HIDEDCT1 -- John's "best match" target string HIDEDST1 -- difflib's "best match" target string HIDESCT1 -- source string John's best match matches in seven of eight positions, compared to six of eight for the difflib best match. Disregarding order, both have seven matching characters. That's a pretty slim difference between the two: >>> "".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1")) ' H I D E- D+ S C T 1' >>> "".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1")) ' H I D E- D S+ C T 1' I honestly don't know how to interpret those results :( >>> s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1") >>> t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1") >>> >>> for block in s.get_matching_blocks(): ... print "a[%d] and b[%d] match for %d elements" % block ... a[0] and b[0] match for 4 elements a[5] and b[5] match for 3 elements a[8] and b[8] match for 0 elements >>> >>> for block in t.get_matching_blocks(): ... print "a[%d] and b[%d] match for %d elements" % block ... a[0] and b[0] match for 4 elements a[5] and b[4] match for 1 elements a[6] and b[6] match for 2 elements a[8] and b[8] match for 0 elements >>> I think what you are seeing is an artifact of the precise details of the pattern matcher. Feel free to subclass the SequenceMatcher or Differ classes to get the results you expect :-) -- Steven D'Aprano -- http://mail.python.org/mailman/listinfo/python-list