> Like I said I know I can use the module Similarity. But in
> order to do this I would need bot the query and the subject
> string. And to get the subject string I would need to 'slide'
> down the larger string and pull out all combinations 1 by 1.
> This is very slow with a 4.5 million character string. I'm just
> looking for a way to speed things up.

You might want to use a variant of Boyer-Moore's algorithm.

Dunno how best BM could be fitted to your case, but I am
pretty sure some variant would work, and could, in principal
at least, give you something like a 100 to, what, 10,000?,
fold speed up.

BM basically has you match the substrings backwards and
take advantage of what you know when this fails. Some
variants add a twist that focuses on elements of the
substring that are (likely to be) infrequent in the large
string.

Here's a simplified version of BM that ignores the fuzzy
aspects of your search:

    search for 'efghmnop'
    in 'abcdefghijklmnopqrstuvwxyzabcdefghmnop'

Take the last letter of the searched for substring, p.

Pick a possible substring endpoint in the large string.
This starts out at an offset from the beginning of the
large string. The offset is the length of the substring.

So, in the above case, you start by comparing p and h.

That fails. Next, given that there is no l in the substring,
you move the end point in the large string forward to the
length of the substring past the l. Then, because you
also know that that is followed by mnop, and the first
4 characters of the substring are not mnop, you move
the endpoint forward by another 4 characters.

Now you compare p and x. That fails. There's no x in
the substring, so compare p and f. That fails. Now, f
is the second character in the substring. So move the
possible end point forward by the length of the substring
minus 2.

Now compare p and p, then all the way backwards for
a match.

Modifications of BM do things like instead of picking
the last character, picking a character that is weighted
according to how far along the substring it is and how
infrequently it occurs in the large string.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to