Re: Building a custom Tokenizer

Martin Sun, 18 Jul 2010 05:10:41 -0700

Hey,

Thanks for the tips. I was pointed towards the KeywordTokenizer by thejava people which returns the full content as one content (not a veryintuitive name in my opinion, but anyway). I might still need to extendthis to do some customizations, so I'll look into the PythonAnalyzersamples.


Thanks again,
Martin

On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:
On Jul 17, 2010, at 22:23, Martin <mar...@webscio.net> wrote:
Hi there,
I'm trying to extend the PythonTokenizer class to build my owncustom tokenizer, but seem to get stuck pretty much soon after that.I know that I'm supposed to extend the incrementToken() method, butwhat exactly am I dealing with in there and what should it return?My goal is to construct a tokenizer that returns pretty largetokens, maybe sentences or even the whole content. The reason I needthis is that the NGramTokenFilter needs a TokenStream to run on, butany other tokenizer removes whitespaces from the text.. and I needngrams that span over spaces :(
Thanks in advance for any hints!
Check out the Java Lucene javadocs and ask again onjava-u...@lucene.apache.org where many more lucene expert users hangout. Subscribe first by sending mail to java-user-subscribe andfollowing the instructions in the response.
I forgot to mention that there a number of PyLucene tests and samplesdoing this by extending PythonAnalyzer. Look for these under the testsand sampled/LuceneInAction directories.
Andi..
Andi..
Regards,
Martin

Re: Building a custom Tokenizer

Reply via email to