[issue2986] difflib.SequenceMatcher not matching long sequences

Terry J. Reedy Wed, 01 Sep 2010 14:37:36 -0700

Terry J. Reedy <[email protected]> added the comment:

While refactoring the code for 2.7, I discovered that the description of the 
heuristic for 2.6 and in the code comments is off by 1. "items that appear more 
than 1% of the time" should actually be "items whose duplicates (after the 
first) appear more than 1% of the time". The discrepancy arises because in the 
following code


        for i, elt in enumerate(b):
            if elt in b2j:
                indices = b2j[elt]
                if n >= 200 and len(indices) * 100 > n:
                    populardict[elt] = 1
                    del indices[:]
                else:
                    indices.append(i)
            else:
                b2j[elt] = [i]

len(indices) is retrieved *before* the index i of the current elt is added. 
Whatever one might think the heuristic 'should' have been (and by the nature of 
heuristics, there is no right answer), the default behavior must remain as it 
is, so we adjusted the code and doc to match that.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue2986>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2986] difflib.SequenceMatcher not matching long sequences

Reply via email to