Hey,
Thanks for the tips. I was pointed towards the KeywordTokenizer by the
java people which returns the full content as one content (not a very
intuitive name in my opinion, but anyway). I might still need to extend
this to do some customizations, so I'll look into the PythonAnalyzer
samples.
Thanks again,
Martin
On Jul 17, 2010, at 22:30, Andi Vajda <va...@apache.org> wrote:
On Jul 17, 2010, at 22:23, Martin <mar...@webscio.net> wrote:
Hi there,
I'm trying to extend the PythonTokenizer class to build my own
custom tokenizer, but seem to get stuck pretty much soon after that.
I know that I'm supposed to extend the incrementToken() method, but
what exactly am I dealing with in there and what should it return?
My goal is to construct a tokenizer that returns pretty large
tokens, maybe sentences or even the whole content. The reason I need
this is that the NGramTokenFilter needs a TokenStream to run on, but
any other tokenizer removes whitespaces from the text.. and I need
ngrams that span over spaces :(
Thanks in advance for any hints!
Check out the Java Lucene javadocs and ask again on
java-u...@lucene.apache.org where many more lucene expert users hang
out. Subscribe first by sending mail to java-user-subscribe and
following the instructions in the response.
I forgot to mention that there a number of PyLucene tests and samples
doing this by extending PythonAnalyzer. Look for these under the tests
and sampled/LuceneInAction directories.
Andi..
Andi..
Regards,
Martin