Referred here from the tutor list. > I'm trying to write a program to test someones typing speed and show > them their mistakes. However I'm getting weird results when looking > for the differences in longer (than 100 chars) strings: > > import difflib > > # a tape measure string (just makes it easier to locate a given index) > a = > '1-3-5-7-9-12-15-18-21-24-27-30-33-36-39-42-45-48-51-54-57-60-63-66-69 > -72-75-78-81-84-87-90-93-96-99-103-107-111-115-119-123-127-131-135-139 > -143-147-151-155-159-163-167-171-175-179-183-187-191-195--200' > > # now with a few mistakes > b = '1-3-5-7- > l-12-15-18-21-24-27-30-33-36-39o42-45-48-51-54-57-60-63-66-69-72-75-78 > -81-84-8k-90-93-96-9l-103-107-111-115-119-12b-1v7-131-135-139-143-147- > 151-m55-159-163-167-a71-175j179-183-187-191-195--200' > > s = difflib.SequenceMatcher(None, a ,b) > ms = s.get_matching_blocks() > > print ms > >>>> [(0, 0, 8), (200, 200, 0)] > > Have I made a mistake or is this function designed to give up when the > input strings get too long? If so what could I use instead to compute > the mistakes in a typed text?
---------- Forwarded message ---------- From: Evert Rol Hi Tom, Ok, I wasn't on the list last year, but I was a few days ago, so persistence pays off; partly, as I don't have a full answer. I got curious and looked at the source of difflib. There's a method __chain_b() which sets up the b2j variable, which contains the occurrences of characters in string b. So cutting b to 199 characters, it looks like this: b2j= 19 {'a': [168], 'b': [122], 'm': [152], 'k': [86], 'v': [125], '-': [1, 3, 5, 7, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 103, 107, 111, 115, 119, 123, 127, 131, 135, 139, 143, 147, 151, 155, 159, 163, 167, 171, 179, 183, 187, 191, 195, 196], 'l': [8, 98], 'o': [39], 'j': [175], '1': [0, 10, 13, 16, 20, 50, 80, 100, 104, 108, 109, 110, 112, 113, 116, 117, 120, 124, 128, 130, 132, 136, 140, 144, 148, 150, 156, 160, 164, 170, 172, 176, 180, 184, 188, 190, 192], '0': [29, 59, 89, 101, 105, 198], '3': [2, 28, 31, 32, 34, 37, 62, 92, 102, 129, 133, 137, 142, 162, 182], '2': [11, 19, 22, 25, 41, 71, 121, 197], '5': [4, 14, 44, 49, 52, 55, 74, 114, 134, 149, 153, 154, 157, 174, 194], '4': [23, 40, 43, 46, 53, 83, 141, 145], '7': [6, 26, 56, 70, 73, 76, 106, 126, 146, 166, 169, 173, 177, 186], '6': [35, 58, 61, 64, 65, 67, 95, 161, 165], '9': [38, 68, 88, 91, 94, 97, 118, 138, 158, 178, 189, 193], '8': [17, 47, 77, 79, 82, 85, 181, 185]} This little detour is because of how b2j is built. Here's a part from the comments of __chain_b(): # Before the tricks described here, __chain_b was by far the most # time-consuming routine in the whole module! If anyone sees # Jim Roskind, thank him again for profile.py -- I never would # have guessed that. And the part of the actual code reads: b = self.b n = len(b) self.b2j = b2j = {} populardict = {} for i, elt in enumerate(b): if elt in b2j: indices = b2j[elt] if n >= 200 and len(indices) * 100 > n: # <--- !! populardict[elt] = 1 del indices[:] else: indices.append(i) else: b2j[elt] = [i] So you're right: it has a stop at the (somewhat arbitrarily) limit of 200 characters. How that exactly works, I don't know (needs more delving into the code), though it looks like there also need to be a lot of indices (len(indices*100>n); I guess that's caused in your strings by the dashes, '1's and '0's (that's why I printed the b2j string). If you feel safe enough and on a fast platform, you can probably up that limit (or even put it somewhere as an optional variable in the code, which I would think is generally better). Not sure who the author of the module is (doesn't list in the file itself), but perhaps you can find out and email him/her, to see what can be altered. Hope that helps. Evert -- http://mail.python.org/mailman/listinfo/python-list