Rob, look at the third hit: http://www.lucenebook.com/search?query=bi-grams
Otis ----- Original Message ---- From: Rob Young <[EMAIL PROTECTED]> > That sounds like just what I'm looking for. Do you know if this is > covered in Lucene in Action or where I can find more information about it. Eric Isakson wrote: >You might consider using overlapping bi-gram tokenization with stripped out >whitespace and a PhraseQuery. > >So your tokenized content, "spongebob squarepants", would look like: > >sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts > >and your tokens for your query, "sponge bob", would look like > >sp po on ng ge eb bo ob > >Add each token to the PhraseQuery and you should match. > >This is very similar to the techniques used for searching in Asian languages >which do not seperate words with spaces. There are probably some side effects >for compound words that you didn't mean to do this too, but without knowing >the exact domain of compound words that you wish to support, this is probably >the best you will be able to do. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
