On 11/29/2010 3:15 PM, Jacob Elder wrote:
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.

What I'd like to see is a CJK filter that runs after tokenization (whitespace in my case) and doesn't do anything but handle the CJK characters. If there are no CJK characters in the token, it should do nothing at all. The CJK tokenizer does a whole host of other things that I want to handle myself.

Shawn

Reply via email to