Mike Rotondo <mroto...@gmail.com> added the comment: >From the source, it seems that there is undocumented behavior to SequenceMatcher which is causing this error. If b is longer than 200 characters, it will consider any element x in b that takes up more than 1% of it's contents as "popular", and thus junk.
So, in this case, difflib is treating each individual digit as an element of your sequences, and each one takes up more than 1% of the complete sequence b. Therefore, each one is "popular", and therefore ignored. A snippet which demonstrates this: from difflib import SequenceMatcher for i in range(1, 202)[::10]: a = "a" * i b = "b" + "a" * i s = SequenceMatcher(None, a, b) print s.find_longest_match(0, len(a), 0, len(b)) Up til i=200, the strings match, but afterwards they do not because "a" is "popular". Strangely, if you get rid of the "b" at the beginning of b, they continue to match at lengths greater than 200. This may be a bug, I'll keep looking into it but someone who knows more should probably take a look too. The comments from difflib.py say some interesting things: # b2j also does not contain entries for "popular" elements, meaning # elements that account for more than 1% of the total elements, and # when the sequence is reasonably large (>= 200 elements); this can # be viewed as an adaptive notion of semi-junk, and yields an enormous # speedup when, e.g., comparing program files with hundreds of # instances of "return NULL;" This seems to mean that you won't actually get an accurate diff in certain cases, which seems odd. At the very least, this behavior should probably be documented. Do people think it should be changed to get rid of the "popularity" heuristic? ---------- nosy: +mrotondo versions: +Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue2986> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com