A full-text tokenizer for the NGramTokenFilter

Martin Sat, 17 Jul 2010 13:30:26 -0700

Hi there,

I have been recently trying to build a lucene index out of ngrams andseem to have stumbled on to a number of issues. I first tried to use theNGramTokenizer, but that thing apparently only takes the first 1024characters to tokenize. Having searched around the web, I came upon thisissue being discussed a couple of years ago and the proposed solutionthere has been using the NGramTokenFilter. Now that filter certainlyworks, but it needs an underlying tokenizer to work with, and I'm justwondering if there is a tokenizer that would return me the whole text.The reason I can't use something like the StandardTokenizer is thatngrams should really include spaces and pretty much every tokenizer getsrid of them.


Thank you very much in advance for any suggestions.

Regards,
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

A full-text tokenizer for the NGramTokenFilter

Reply via email to