loose matching with regex

2001-07-28 Thread Bob Mangold
Hello, I'm working on a program where I am searching for a short string within a longer string. The catch is that the long string is about 4.5 million chars long and the short string is about 500. Using a regex to do an exact match is simple, but what if I want just a close match, like 80% or

Re: loose matching with regex

2001-07-28 Thread Abdulaziz Ghuloum
Hello, I don't have a direct answer for your question since your question is a little bit ambigious; let me explain: Do you want to search for a substring in a long string, or you want true regexp match? If you want a true regexp match, then the question is even more ambigious. For example,

Re: (MORE INFO) loose matching with regex

2001-07-28 Thread Bob Mangold
Aziz, I guess I hadn't thought about it that way, so here is more info. What I'm basically doing is randomly pulling a string of 500 from one string and looking for it in another string. So I'm looking for a substring of the larger string that matches my query string. In terms of how it matches

Re: (MORE INFO) loose matching with regex

2001-07-28 Thread Abdulaziz Ghuloum
Hello again, I would suggest you look at Algorithm::Diff module (available at CPAN). The function LCS, given 2 strings, gives you the longest common sequence between the 2 strings. Once you have the longest common sequence, you can probably decide whether it meets the 80% criterion you set or

Re: (MORE INFO) loose matching with regex

2001-07-28 Thread Abdulaziz Ghuloum
Hello again, I have no background in genetic analysis but it looks like there is so much effort going on in the Bio:: modules. There is a module called Bio::SeqFeature::Similarity that might be doing just what you want. But then again, it may not :-) Hope this helps,,, Aziz,,, In article

Re: (MORE INFO) loose matching with regex

2001-07-28 Thread Me
Like I said I know I can use the module Similarity. But in order to do this I would need bot the query and the subject string. And to get the subject string I would need to 'slide' down the larger string and pull out all combinations 1 by 1. This is very slow with a 4.5 million character

Re: (MORE INFO) loose matching with regex

2001-07-28 Thread Me
search for 'efghmnop' in 'abcdefghijklmnopqrstuvwxyzabcdefghmnop' Take the last letter of the searched for substring, p. Pick a possible substring endpoint in the large string. This starts out at an offset from the beginning of the large string. The offset is the length of the