Hi there,

I'm trying to extend the PythonTokenizer class to build my own custom tokenizer, but seem to get stuck pretty much soon after that. I know that I'm supposed to extend the incrementToken() method, but what exactly am I dealing with in there and what should it return? My goal is to construct a tokenizer that returns pretty large tokens, maybe sentences or even the whole content. The reason I need this is that the NGramTokenFilter needs a TokenStream to run on, but any other tokenizer removes whitespaces from the text.. and I need ngrams that span over spaces :(

Thanks in advance for any hints!

Regards,
Martin

Reply via email to