Re: tokenizer to strip a set of characters

Jack Krupansky Thu, 21 Nov 2013 11:18:44 -0800

The word delimiter filter has the ability to pass a table which specifiesthe type for a character:


http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html#WordDelimiterFilter(org.apache.lucene.analysis.TokenStream,byte[], int, org.apache.lucene.analysis.util.CharArraySet)

There is also a regex token filter that you could use to make fineadjustments, like character allowed within tokens but ignored at the startor end.


-- Jack Krupansky

-----Original Message-----From: Stephane Nicoll

Sent: Thursday, November 21, 2013 9:42 AM
To: java-user@lucene.apache.org
Subject: tokenizer to strip a set of characters

Hi,

I am using lucene 3.6 and I am looking to a tokenized that would remove
certain characters when they are present at the beginning or at the end of
a token.

I initially used the StandardAnalyzer and switched to the
WhitespaceAnalyser because it was too agressive for my use case.

A few examples:

  - foo, -> foo (comma at the end)
  - foo. -> foo (period at the end)
  - foo!!!! -> foo
  - foo?! -> foo
  - ,foo -> foo (comma at the beginning of a word is a typo mistake but
  should be handled-

Is there a configurable tokenizer I could use for this?

Thanks,

S.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: tokenizer to strip a set of characters

Reply via email to