New submission from Lewis Haley:

Consider the following snippet:

import difflib

first = u'location,location,location'
for second in (
    u'location.location.location',  # two periods (no commas)
    u'location.location,location',  # period after first
    u'location,location.location',  # period after second
    u'location,location,location',  # perfect match
):
    edit_dist = difflib.SequenceMatcher(None, first, second).ratio()
    print("comparing %r vs. %r gives edit dist: %g" % (first, second, 
edit_dist))

I would expect the second and third tests to give the same result, but in 
reality:

comparing u'location,location,location' vs. u'location.location.location' gives 
edit dist: 0.923077
comparing u'location,location,location' vs. u'location.location,location' gives 
edit dist: 0.653846
comparing u'location,location,location' vs. u'location,location.location' gives 
edit dist: 0.961538
comparing u'location,location,location' vs. u'location,location,location' gives 
edit dist: 1

The same results are received from Python 3.4.

>From experimenting, it seems that when the period comes after the first 
>"location", the longest match found is the final two "locations" from the 
>first string against the first two "locations" from the second string.

In [31]: difflib.SequenceMatcher(None, u'location,location,location', 
u'location.location,location').ratio()
Out[31]: 0.6538461538461539

In [32]: difflib.SequenceMatcher(None, u'location,location,location', 
u'location.location,location').get_matching_blocks()
Out[32]: [Match(a=0, b=9, size=17), Match(a=26, b=26, size=0)]

In [33]: difflib.SequenceMatcher(None, u'location,location,location', 
u'location,location.location').ratio()Out[33]: 0.9615384615384616

In [34]: difflib.SequenceMatcher(None, u'location,location,location', 
u'location,location.location').get_matching_blocks()
Out[34]: 
[Match(a=0, b=0, size=17),
 Match(a=18, b=18, size=8),
 Match(a=26, b=26, size=0)]

Using `quick_ratio` instead of `ratio` gives (what I consider to be) the 
correct result.

----------
components: Library (Lib)
files: test.py
messages: 252925
nosy: Lewis Haley
priority: normal
severity: normal
status: open
title: difflib.SequenceMatcher(...).ratio gives bad/wrong/unexpected low value 
with repetitous strings
versions: Python 2.7, Python 3.4
Added file: http://bugs.python.org/file40767/test.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25391>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to