Alex, Thank you for the quick response. Apologies for my delay. Y, we'll use edismax. That won't solve the issue of multilingual documents...I don't think...unless we index every document as every language. Let's say a predominantly English document contains a Chinese sentence. If the English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, the Chinese sentence could be tokenized as one big token (if it doesn't have any punctuation, of course) and will be effectively unsearchable...barring use of wildcards. So, what we're looking for is a basic, reliable-ish field configuration to handle all languages as a fallback. So we were thinking, perhaps, ICUTokenizer with ICUFoldingFilter and perhaps a multilingual stopword list. We do want the language specific handling for most cases, and the basic langid+field per language setup with edismax will get us that. Any thoughts?
Thank you, again. Best, Tim I don't think the text_all field would work too well for multilingual setup. Any reason you cannot use edismax to search over a bunch of fields instead? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency ________________________________ From: Allison, Timothy B. Sent: Wednesday, June 18, 2014 9:31 PM To: solr-user@lucene.apache.org Subject: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs All, In one index I’m working with, the setup is the typical langid mapping to language specific fields. There is also a text_all field that everything is copied to. The documents can contain a wide variety of languages including non-whitespace languages. We’ll be using the ICUTokenFilter in the analysis chain, but what should we use for the tokenizer for the “text_all” field? My inclination is to go with the ICUTokenizer. Are there any reasons to prefer the StandardTokenizer or another tokenizer for this field? Thank you. Best, Tim