ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Allison, Timothy B. Wed, 18 Jun 2014 18:32:30 -0700

All,

In one index I’m working with, the setup is the typical langid mapping to 
language specific fields.  There is also a text_all field that everything is 
copied to.  The documents can contain a wide variety of languages including 
non-whitespace languages.  We’ll be using the ICUTokenFilter in the analysis 
chain, but what should we use for the tokenizer for the “text_all” field?  My 
inclination is to go with the ICUTokenizer.  Are there any reasons to prefer 
the StandardTokenizer or another tokenizer for this field?


Thank you.

       Best,

              Tim

ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Reply via email to