Hi there,
I'm trying to extend the PythonTokenizer class to build my own custom
tokenizer, but seem to get stuck pretty much soon after that. I know
that I'm supposed to extend the incrementToken() method, but what
exactly am I dealing with in there and what should it return? My goal is
to construct a tokenizer that returns pretty large tokens, maybe
sentences or even the whole content. The reason I need this is that the
NGramTokenFilter needs a TokenStream to run on, but any other tokenizer
removes whitespaces from the text.. and I need ngrams that span over
spaces :(
Thanks in advance for any hints!
Regards,
Martin