Hey,

Thanks for the tips. I was pointed towards the KeywordTokenizer by the java people which returns the full content as one content (not a very intuitive name in my opinion, but anyway). I might still need to extend this to do some customizations, so I'll look into the PythonAnalyzer samples.

Thanks again,
Martin

On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:


On Jul 17, 2010, at 22:23, Martin <mar...@webscio.net> wrote:

Hi there,

I'm trying to extend the PythonTokenizer class to build my own custom tokenizer, but seem to get stuck pretty much soon after that. I know that I'm supposed to extend the incrementToken() method, but what exactly am I dealing with in there and what should it return? My goal is to construct a tokenizer that returns pretty large tokens, maybe sentences or even the whole content. The reason I need this is that the NGramTokenFilter needs a TokenStream to run on, but any other tokenizer removes whitespaces from the text.. and I need ngrams that span over spaces :(

Thanks in advance for any hints!

Check out the Java Lucene javadocs and ask again on java-u...@lucene.apache.org where many more lucene expert users hang out. Subscribe first by sending mail to java-user-subscribe and following the instructions in the response.

I forgot to mention that there a number of PyLucene tests and samples doing this by extending PythonAnalyzer. Look for these under the tests and sampled/LuceneInAction directories.

Andi..


Andi..


Regards,
Martin



Reply via email to