Because the text in my index comes in many different languages with no
ability to know the language ahead of time, I have a need to use
ICUTokenizer and/or the CJK filters, but I have a problem with them as
they are implemented currently. They do extra things like handle email
addresses, tokenize on non-alphanumeric characters, etc. I need them to
not do these things. This is my current index analyzer chain:
http://pastebin.com/dNBGmeeW
My current idea for how to change this is to use the ICUTokenizer
instead of the WhitespaceTokenizer, then as one of the later steps, run
it through CJK so that it outputs bigrams for the CJK characters. The
reason I can't do this now is that I must let WordDelimiterFilter handle
punctuation, case changes, and numbers, because of the magic of the
preserveOriginal flag.
Is it possible to turn off these extra features in these analyzer
components as they are written now? If not, is it a painful process for
someone with Java experience to customize the code so it IS possible? I
have not yet looked at the code, but I will do so in the next couple of
days. Ideally, I would also like to have a WordDelimiterFilter that is
fully aware of international capitalization via ICU. Does any such
thing exist?
In the current chain, you'll notice a pattern filter. What this does is
remove leading and trailing punctuation from tokens. Punctuation inside
the token is preserved, for later handling with WordDelimiterFilter.
Thanks,
Shawn