New submission from Lewis Haley: Consider the following snippet:
import difflib first = u'location,location,location' for second in ( u'location.location.location', # two periods (no commas) u'location.location,location', # period after first u'location,location.location', # period after second u'location,location,location', # perfect match ): edit_dist = difflib.SequenceMatcher(None, first, second).ratio() print("comparing %r vs. %r gives edit dist: %g" % (first, second, edit_dist)) I would expect the second and third tests to give the same result, but in reality: comparing u'location,location,location' vs. u'location.location.location' gives edit dist: 0.923077 comparing u'location,location,location' vs. u'location.location,location' gives edit dist: 0.653846 comparing u'location,location,location' vs. u'location,location.location' gives edit dist: 0.961538 comparing u'location,location,location' vs. u'location,location,location' gives edit dist: 1 The same results are received from Python 3.4. >From experimenting, it seems that when the period comes after the first >"location", the longest match found is the final two "locations" from the >first string against the first two "locations" from the second string. In [31]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').ratio() Out[31]: 0.6538461538461539 In [32]: difflib.SequenceMatcher(None, u'location,location,location', u'location.location,location').get_matching_blocks() Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)] In [33]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').ratio()Out[33]: 0.9615384615384616 In [34]: difflib.SequenceMatcher(None, u'location,location,location', u'location,location.location').get_matching_blocks() Out[34]: [Match(a=0, b=0, size=17), Match(a=18, b=18, size=8), Match(a=26, b=26, size=0)] Using `quick_ratio` instead of `ratio` gives (what I consider to be) the correct result. ---------- components: Library (Lib) files: test.py messages: 252925 nosy: Lewis Haley priority: normal severity: normal status: open title: difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value with repetitous strings versions: Python 2.7, Python 3.4 Added file: http://bugs.python.org/file40767/test.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue25391> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com