[issue31889] difflib SequenceMatcher ratio() still have unpredictable behavior

Tim Peters Fri, 03 Nov 2017 18:42:50 -0700

Tim Peters <t...@python.org> added the comment:

Pass "autojunk=False" to your SequenceMatcher constructor and the ratio you get 
back will continue to increase as `i` increases.


The docs:

"""
Automatic junk heuristic: SequenceMatcher supports a heuristic that 
automatically treats certain sequence items as junk. The heuristic counts how 
many times each individual item appears in the sequence. If an item’s 
duplicates (after the first one) account for more than 1% of the sequence and 
the sequence is at least 200 items long, this item is marked as “popular” and 
is treated as junk for the purpose of sequence matching. This heuristic can be 
turned off by setting the autojunk argument to False when creating the 
SequenceMatcher.
"""

Note in particular the "at least 200 items long" there.  That's why you see a 
change in behavior at i == 200.  Before then, the autojunk heuristic is not 
triggered.

If we had it all to do over again, I'd make autojunk=False the default.  But - 
for backward compatibility - it's too late to change that now.

As to quick_ratio() and real_quick_ratio(), NOTHING about them is defined - or 
intended to be defined - beyond what the docs already say:

"""
quick_ratio()
Return an upper bound on ratio() relatively quickly.

real_quick_ratio()
Return an upper bound on ratio() very quickly.
"""

Nothing in the code you posted violates those promises (e.g., 0.99 is certainly 
an upper bound with respect to 0.0).  If they always returned 1.0, regardless 
of inputs, that would meet the docs' promises too.  Only the result of ratio() 
is defined.

As to why they exist, the docs explain that:

"""
ratio()
Return a measure of the sequences’ similarity as a float in the range [0, 1].

...

This is expensive to compute if get_matching_blocks() or get_opcodes() hasn’t 
already been called, in which case you may want to try quick_ratio() or 
real_quick_ratio() first to get an upper bound.
"""

So if the expense of calling ratio() isn't a concern in your context, there's 
no reason to call quick_ratio() or real_quick_ratio().  They exist so it's 
possible to code a cheap "early out" path in contexts where calling ratio() all 
the time may be prohibitively expensive.

For example, consider an app that has no use at all for a ratio < 0.5.  If 
quick_ratio() (or real_quick_ratio()) return something less than 0.5, then it's 
guaranteed that ratio() would too, since the quicker versions return upper 
bounds on what ratio() would return.

This is valuable where it matters, and completely useless where it doesn't 
matter ;-)

In any case, since `autojunk=False` repairs the behavior you pointed out, I'm 
closing this as Won't-Fix.

----------
resolution:  -> wont fix
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue31889>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31889] difflib SequenceMatcher ratio() still have unpredictable behavior

Reply via email to