tokenizer to strip a set of characters

Stephane Nicoll Thu, 21 Nov 2013 10:58:27 -0800

Hi,

I am using lucene 3.6 and I am looking to a tokenized that would remove
certain characters when they are present at the beginning or at the end of
a token.


I initially used the StandardAnalyzer and switched to the
WhitespaceAnalyser because it was too agressive for my use case.

A few examples:

   - foo, -> foo (comma at the end)
   - foo. -> foo (period at the end)
   - foo!!!! -> foo
   - foo?! -> foo
   - ,foo -> foo (comma at the beginning of a word is a typo mistake but
   should be handled-

Is there a configurable tokenizer I could use for this?

Thanks,
S.

tokenizer to strip a set of characters

Reply via email to