Hi again. You can find additional info regarding this Bigram index here: http://asbjorn.fellinghaug.com/blog/master-thesis/
The source code was available, from the same site but it has disappeared. However, it can be downloaded from the computer science department at NTNU in Norway: http://daim.idi.ntnu.no/show.php?type=vedlegg&id=3429 Hope this helps. Hayes, Peter: > Thanks for your input. I will try and apply your suggestion. > > Thanks, > Peter > > -----Original Message----- > From: Asbjørn A. Fellinghaug [mailto:asbj...@fellinghaug.com] > Sent: Thursday, January 15, 2009 3:25 AM > To: java-user@lucene.apache.org > Subject: Re: Google finance-like suggestible search field > > > Hi. > > Such 'autocompletion' features with Lucene could be provided with n-gram > tokenizers, as Erick states. I made a 'Bigram' analyzer for my master > thesis, when I was doing some research on how to enhance phrase > searching. This Analyzer considers pair of words as single terms. > > Basically, what the Bigram analyzer does is to index stopwords combined > with the "previous" word, and with the "next" word. Single stopwords > would not be indexed, as they demand a lot of resources during searches. > Only combination of prev+stopword and stopword+nextword would be > indexed. This saves a lot during searching. > > Consider this sentence: "fetch me a beer honey" (where 'a' and 'me' is > stopwords). The Bigram analyzer would index these 'Tokens': > 'fetch', 'fetch me', 'me a', 'a beer', 'honey'. > > Erick Erickson: > > You could look at the n-gram tokenizers (I confess I haven't used them > > so I'm not all *that* familiar with them). Or you could make a rule like > > "no autocomplete until the user types 3 characters" if that would work. > > > > Instead of forming a query, you might try using TermEnum, or > > WildCardTermEnum > > or even RegexTermEnum to quickly get the list of terms for your > > autocomplete. The > > nice part about this approach is that you could quit after a suitable number > > of > > terms were found rather than get them all. As I remember, WildCardTermEnum > > is > > faster than RegexTermEnum, but don't hold me to that. So I'd try > > WildCardTermEnum > > first, I think you'll find it much more suitable than forming > > > > Best > > Erick > > -- > Asbjørn A. Fellinghaug > asbj...@fellinghaug.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Asbjørn A. Fellinghaug asbj...@fellinghaug.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org