Re: Partial token matches

Paul . Illingworth Thu, 27 Apr 2006 01:06:07 -0700



Another approach maybe to use n-grams. Index each word as follows

2 gram field
in  nf  fo  or rm  ma  at

3 gram field
inf  nfo  for  orm rma  mat

4 gram field
info nfor form orm rmat

and so on.

To search for term "form" simply search the 4 gram field.

The prefix query approach may suffer from the too many clauses exception if
you have lots of words beginning with, in your exmaple, "form". The
approach above would avoid this but will obviously have a much bigger index
size.

Regards

Paul I.



                                                                           
             "Eric Isakson"                                                
             <[EMAIL PROTECTED]                                             
             .com>                                                      To 
                                       <[email protected]>       
             26/04/2006 17:20                                           cc 
                                                                           
                                                                   Subject 
             Please respond to         Partial token matches               
             [EMAIL PROTECTED]                                             
                apache.org                                                 
                                                                           
                                                                           
                                                                           
                                                                           




Hi All,

Just wanted to throw out something I'm working on. It is working well for
me, but I wanted to see if anyone can suggest any other alternatives that
might perform better than what I'm doing now.

I have a field in my index that contains keywords (back of the book index
terms) and a UI feature that allows the user to find documents that contain
a partial keyword supplied by the user. So a particular document in my
index might have the token "informat" in the keywords field and the user
may supply "form" in the UI and I should get a match.

My old implementation does not use Lucene and just uses String.matches with
a regular expression that looks like ".*form.*". I reimplemented using
Lucene and just tokenize the field so I get the tokens

informat
nformat
format
ormat
rmat
mat
at
t

Then I use a prefix query to find hits. Both implementations ignore case in
the search and the hit order is controlled by another field that I'm
sorting on, so relevance ranking is not important in this use case. Search
time performance is crucial, time to create the index and index size are
not really important. The index is created statically at application
startup or possibly delivered to the application and is not updated while
the application is using it.

Thanks for any suggestions,
Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Re: Partial token matches

Reply via email to