International filters/tokenizers doing too much

Shawn Heisey Tue, 14 Jun 2011 16:08:36 -0700

Because the text in my index comes in many different languages with noability to know the language ahead of time, I have a need to useICUTokenizer and/or the CJK filters, but I have a problem with them asthey are implemented currently. They do extra things like handle emailaddresses, tokenize on non-alphanumeric characters, etc. I need them tonot do these things. This is my current index analyzer chain:


http://pastebin.com/dNBGmeeW

My current idea for how to change this is to use the ICUTokenizerinstead of the WhitespaceTokenizer, then as one of the later steps, runit through CJK so that it outputs bigrams for the CJK characters. Thereason I can't do this now is that I must let WordDelimiterFilter handlepunctuation, case changes, and numbers, because of the magic of thepreserveOriginal flag.

Is it possible to turn off these extra features in these analyzercomponents as they are written now? If not, is it a painful process forsomeone with Java experience to customize the code so it IS possible? Ihave not yet looked at the code, but I will do so in the next couple ofdays. Ideally, I would also like to have a WordDelimiterFilter that isfully aware of international capitalization via ICU. Does any suchthing exist?

In the current chain, you'll notice a pattern filter. What this does isremove leading and trailing punctuation from tokens. Punctuation insidethe token is preserved, for later handling with WordDelimiterFilter.


Thanks,
Shawn

International filters/tokenizers doing too much

Reply via email to