That sounds like just what I'm looking for. Do you know if this is
covered in Lucene in Action or where I can find more information about it.
Eric Isakson wrote:
You might consider using overlapping bi-gram tokenization with stripped out
whitespace and a PhraseQuery.
So your tokenized content, "spongebob squarepants", would look like:
sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts
and your tokens for your query, "sponge bob", would look like
sp po on ng ge eb bo ob
Add each token to the PhraseQuery and you should match.
This is very similar to the techniques used for searching in Asian languages
which do not seperate words with spaces. There are probably some side effects
for compound words that you didn't mean to do this too, but without knowing the
exact domain of compound words that you wish to support, this is probably the
best you will be able to do.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]