Because the text in my index comes in many different languages with no ability to know the language ahead of time, I have a need to use ICUTokenizer and/or the CJK filters, but I have a problem with them as they are implemented currently. They do extra things like handle email addresses, tokenize on non-alphanumeric characters, etc. I need them to not do these things. This is my current index analyzer chain:

http://pastebin.com/dNBGmeeW

My current idea for how to change this is to use the ICUTokenizer instead of the WhitespaceTokenizer, then as one of the later steps, run it through CJK so that it outputs bigrams for the CJK characters. The reason I can't do this now is that I must let WordDelimiterFilter handle punctuation, case changes, and numbers, because of the magic of the preserveOriginal flag.

Is it possible to turn off these extra features in these analyzer components as they are written now? If not, is it a painful process for someone with Java experience to customize the code so it IS possible? I have not yet looked at the code, but I will do so in the next couple of days. Ideally, I would also like to have a WordDelimiterFilter that is fully aware of international capitalization via ICU. Does any such thing exist?

In the current chain, you'll notice a pattern filter. What this does is remove leading and trailing punctuation from tokens. Punctuation inside the token is preserved, for later handling with WordDelimiterFilter.

Thanks,
Shawn

Reply via email to