All,

In one index I’m working with, the setup is the typical langid mapping to 
language specific fields.  There is also a text_all field that everything is 
copied to.  The documents can contain a wide variety of languages including 
non-whitespace languages.  We’ll be using the ICUTokenFilter in the analysis 
chain, but what should we use for the tokenizer for the “text_all” field?  My 
inclination is to go with the ICUTokenizer.  Are there any reasons to prefer 
the StandardTokenizer or another tokenizer for this field?

Thank you.

       Best,

              Tim

Reply via email to