All, In one index I’m working with, the setup is the typical langid mapping to language specific fields. There is also a text_all field that everything is copied to. The documents can contain a wide variety of languages including non-whitespace languages. We’ll be using the ICUTokenFilter in the analysis chain, but what should we use for the tokenizer for the “text_all” field? My inclination is to go with the ICUTokenizer. Are there any reasons to prefer the StandardTokenizer or another tokenizer for this field?
Thank you. Best, Tim