Referred here from the tutor list.

> I'm trying to write a program to test someones typing speed and show
> them their mistakes. However I'm getting weird results when looking
> for the differences in longer (than 100 chars) strings:
>
> import difflib
>
> # a tape measure string (just makes it easier to locate a given index)
> a =
> '1-3-5-7-9-12-15-18-21-24-27-30-33-36-39-42-45-48-51-54-57-60-63-66-69
> -72-75-78-81-84-87-90-93-96-99-103-107-111-115-119-123-127-131-135-139
> -143-147-151-155-159-163-167-171-175-179-183-187-191-195--200'
>
> # now with a few mistakes
> b = '1-3-5-7-
> l-12-15-18-21-24-27-30-33-36-39o42-45-48-51-54-57-60-63-66-69-72-75-78
> -81-84-8k-90-93-96-9l-103-107-111-115-119-12b-1v7-131-135-139-143-147-
> 151-m55-159-163-167-a71-175j179-183-187-191-195--200'
>
> s = difflib.SequenceMatcher(None, a ,b)
> ms = s.get_matching_blocks()
>
> print ms
>
>>>> [(0, 0, 8), (200, 200, 0)]
>
> Have I made a mistake or is this function designed to give up when the
> input strings get too long? If so what could I use instead to compute
> the mistakes in a typed text?

---------- Forwarded message ----------
From: Evert Rol

Hi Tom,

Ok, I wasn't on the list last year, but I was a few days ago, so
persistence pays off; partly, as I don't have a full answer.

I got curious and looked at the source of difflib. There's a method
__chain_b() which sets up the b2j variable, which contains the
occurrences of characters in string b. So cutting b to 199
characters, it looks like this:
    b2j= 19 {'a': [168], 'b': [122], 'm': [152], 'k': [86], 'v':
[125], '-': [1, 3, 5, 7, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 42,
45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93,
96, 99, 103, 107, 111, 115, 119, 123, 127, 131, 135, 139, 143, 147,
151, 155, 159, 163, 167, 171, 179, 183, 187, 191, 195, 196], 'l': [8,
98], 'o': [39], 'j': [175], '1': [0, 10, 13, 16, 20, 50, 80, 100,
104, 108, 109, 110, 112, 113, 116, 117, 120, 124, 128, 130, 132, 136,
140, 144, 148, 150, 156, 160, 164, 170, 172, 176, 180, 184, 188, 190,
192], '0': [29, 59, 89, 101, 105, 198], '3': [2, 28, 31, 32, 34, 37,
62, 92, 102, 129, 133, 137, 142, 162, 182], '2': [11, 19, 22, 25, 41,
71, 121, 197], '5': [4, 14, 44, 49, 52, 55, 74, 114, 134, 149, 153,
154, 157, 174, 194], '4': [23, 40, 43, 46, 53, 83, 141, 145], '7':
[6, 26, 56, 70, 73, 76, 106, 126, 146, 166, 169, 173, 177, 186], '6':
[35, 58, 61, 64, 65, 67, 95, 161, 165], '9': [38, 68, 88, 91, 94, 97,
118, 138, 158, 178, 189, 193], '8': [17, 47, 77, 79, 82, 85, 181,
185]}

This little detour is because of how b2j is built. Here's a part from
the comments of __chain_b():

    # Before the tricks described here, __chain_b was by far the most
    # time-consuming routine in the whole module!  If anyone sees
    # Jim Roskind, thank him again for profile.py -- I never would
    # have guessed that.

And the part of the actual code reads:
         b = self.b
         n = len(b)
         self.b2j = b2j = {}
         populardict = {}
         for i, elt in enumerate(b):
             if elt in b2j:
                 indices = b2j[elt]
                 if n >= 200 and len(indices) * 100 > n:     # <--- !!
                     populardict[elt] = 1
                     del indices[:]
                 else:
                     indices.append(i)
             else:
                 b2j[elt] = [i]

So you're right: it has a stop at the (somewhat arbitrarily) limit of
200 characters. How that exactly works, I don't know (needs more
delving into the code), though it looks like there also need to be a
lot of indices (len(indices*100>n); I guess that's caused in your
strings by the dashes, '1's and '0's (that's why I printed the b2j
string).
If you feel safe enough and on a fast platform, you can probably up
that limit (or even put it somewhere as an optional variable in the
code, which I would think is generally better).
Not sure who the author of the module is (doesn't list in the file
itself), but perhaps you can find out and email him/her, to see what
can be altered.

Hope that helps.

   Evert

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to