You're hitting a problem that we've all tried to solve: multilingual indexes.
You're forgetting one important point: if the user enters a query, then you have to choose which language field to search against, either by language detection (out of band, by some other software?), or by select-box in your gui. You want to prevent that solr performs an incorrect field analysis.. Therefore copying that language specific field to another field after analysis (if that was possible) does not solve the problem, because the analysis at query time needs to match the analysis at index item. I would focus on language detection of your input query, and then search on that language specific field. On 8 Oct 2021, at 15:17, Aleksandar Kanchev <[email protected]<mailto:[email protected]>> wrote: Hi, My name is Aleksandar Kanchev. I am a web developer and have been using Apache Solr for ten years now. We have been having issues handling multi-lingual content for quite some time now and we are sure it's not an Apache Solr deficiency but a knowledge deficiency on our side. *What we are trying to do* Index and query content on different languages (English, Persian, Chinese, Japanese, etc.) in one Solr field. *The problem * Different language groups need different tokenizers and filters. We can't apply multiple tokenizers to the same Solr field type. *How we tried to solve the problem* - Create a separate Solr field with a specific field type (different tokenizer and filters configuration) for each particular language group. Example: <field name="content_cjk" type="text_cjk" multiValued="false" indexed="true" stored="false" useDocValuesAsStored="false"/> <field name="content_pe" type="text_pe" multiValued="false" indexed="true" stored="false" useDocValuesAsStored="false"/> ... - Then create a copy field. Example: <field name="content" type="text_general" multiValued="true" indexed="true" stored="false" useDocValuesAsStored="false"/> <copyField source="content_cjk" dest="content"/> <copyField source="content_pe" dest="content"/> ... *The issue* Copy fields are copied before the analysis, so we lose the tokenization and filtering from the language-specific fields. *Questions* - Can we use copy fields and preserve the generated tokens from the source fields? - Is there a better way to apply multiple language-specific tokenizers and filters on a single Solr field? Thanks in advance! Best, Alex
