Building a custom Tokenizer

Martin Sat, 17 Jul 2010 13:24:04 -0700

Hi there,

I'm trying to extend the PythonTokenizer class to build my own customtokenizer, but seem to get stuck pretty much soon after that. I knowthat I'm supposed to extend the incrementToken() method, but whatexactly am I dealing with in there and what should it return? My goal isto construct a tokenizer that returns pretty large tokens, maybesentences or even the whole content. The reason I need this is that theNGramTokenFilter needs a TokenStream to run on, but any other tokenizerremoves whitespaces from the text.. and I need ngrams that span overspaces :(


Thanks in advance for any hints!

Regards,
Martin

Building a custom Tokenizer

Reply via email to