String similarity search vs. typcial IR application...

Jim Hargrave Fri, 06 Jun 2003 07:20:56 -0700

Our application is a string similarity searcher where the query is an input string and 
we want to find all "fuzzy" variants of the input string in the DB.  The Score is 
basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in 
common, Q is the number of unique query terms and D is the number of unique document 
terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow 
since it must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in common 
between query and document. Is there an API that I can call to get this info? Any 
hints on what it will take to modify Lucene to handle these kinds of queries? 
 
BTW: 
Ever consider using Lucene for DNA searching? - this technique could also be used to 
search large DNA databases.
 
Thanks!
 
Jim Hargrave



------------------------------------------------------------------------------
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.


==============================================================================

String similarity search vs. typcial IR application...

Reply via email to