Hello again,

I would suggest you look at Algorithm::Diff module (available at CPAN).
The function LCS, given 2 strings, gives you the "longest common
sequence" between the 2 strings.  Once you have the longest common
sequence, you can probably decide whether it meets the 80% criterion you
set or not.  

I don't know anything about the effeciency of this module when it comes
to big data but give it a shot and see if it helps you.  If it's doing
what you need but is too slow, look at the code to see how it works, and
you might find it easy to reimplement in C using the Inline::C module.

Hope this helps,,,

Aziz,,,

In article <[EMAIL PROTECTED]>, "Bob
Mangold" <[EMAIL PROTECTED]> wrote:

> Aziz,
> 
> I guess I hadn't thought about it that way, so here is more info.
> 
> What I'm basically doing is randomly pulling a string of 500 from one
> string and looking for it in another string. So I'm looking for a
> substring of the larger string that matches my query string. In terms of
> how it matches the answer and to your questions, all of the above. I
> don't care if there are insertions, deletions of just character changes,
> as long as the query sting is 80% similar to the subject string.
> 
> Like I said I know I can use the module Similarity. But in order to do
> this I would need bot the query and the subject string. And to get the
> subject string I would need to 'slide' down the larger string and pull
> out all combinations 1 by 1. This is very slow with a 4.5 million
> character string. I'm just looking for a way to speed things up.
> 
> BTW, if it helps at all, I'm doing genetic analysis of whole genomes,
> hence the 4.5 million long string.
> 
> -Bob

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to