Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But 
DASG+Lev does look interesting.
Our app is a linguistic application. We want to search for sentences which have many 
ngrams in common and rank them based on the score below. Similar to the TELLTALE 
system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se 
- we want to compute a score based on pure string similarity. Sentences are docs, 
ngrams are terms.

>>> [EMAIL PROTECTED] 06/05/03 03:55PM >>>
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).


Jim Hargrave wrote:

>Our application is a string similarity searcher where the query is an input string 
>and we want to find all "fuzzy" variants of the input string in the DB.  The Score is 
>basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
>common, Q is the number of unique query terms and D is the number of unique document 
>terms. Our documents will be sentences.
>I know Lucene has a fuzzy search capability - but I assume this would be very slow 
>since it must search through the entire term list to find candidates.
>In order to do the calculation I will need to have 'C' - the number of terms in 
>common between query and document. Is there an API that I can call to get this info? 
>Any hints on what it will take to modify Lucene to handle these kinds of queries? 

To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 

This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


Reply via email to