Hello. We have documents with multilingual words which consist of different languages parts and seach queries of the same complexity, and it is a worldwide used online application, so users generate content in all the possible world languages.
For example: 言語-aware Løgismose-alike ຄໍາຮ້ອງສະຫມັກ-dependent So I guess our schema requires a single field with universal analyzers. Luckily, there exist ICUTokenizer and ICUFoldingFilter for that. But then it requires stemming and lemmatization. How to implement a schema with universal stemming/lemmatization which would probably utilize the ICU generated token script attribute? http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html By the way, I have already examined the Basistech schema of their commercial plugins and it defines tokenizer/filter language per field type, which is not a universal solution for such complex multilingual texts. Please advise how to address this task. Sincerely, Ilia Sretenskii.