Hi there,

I have been recently trying to build a lucene index out of ngrams and seem to have stumbled on to a number of issues. I first tried to use the NGramTokenizer, but that thing apparently only takes the first 1024 characters to tokenize. Having searched around the web, I came upon this issue being discussed a couple of years ago and the proposed solution there has been using the NGramTokenFilter. Now that filter certainly works, but it needs an underlying tokenizer to work with, and I'm just wondering if there is a tokenizer that would return me the whole text. The reason I can't use something like the StandardTokenizer is that ngrams should really include spaces and pretty much every tokenizer gets rid of them.

Thank you very much in advance for any suggestions.

Regards,
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to