Hi, My name is Aleksandar Kanchev.
I am a web developer and have been using Apache Solr for ten years now. We have been having issues handling multi-lingual content for quite some time now and we are sure it's not an Apache Solr deficiency but a knowledge deficiency on our side. *What we are trying to do* Index and query content on different languages (English, Persian, Chinese, Japanese, etc.) in one Solr field. *The problem * Different language groups need different tokenizers and filters. We can't apply multiple tokenizers to the same Solr field type. *How we tried to solve the problem* - Create a separate Solr field with a specific field type (different tokenizer and filters configuration) for each particular language group. Example: <field name="content_cjk" type="text_cjk" multiValued="false" indexed="true" stored="false" useDocValuesAsStored="false"/> <field name="content_pe" type="text_pe" multiValued="false" indexed="true" stored="false" useDocValuesAsStored="false"/> ... - Then create a copy field. Example: <field name="content" type="text_general" multiValued="true" indexed="true" stored="false" useDocValuesAsStored="false"/> <copyField source="content_cjk" dest="content"/> <copyField source="content_pe" dest="content"/> ... *The issue* Copy fields are copied before the analysis, so we lose the tokenization and filtering from the language-specific fields. *Questions* - Can we use copy fields and preserve the generated tokens from the source fields? - Is there a better way to apply multiple language-specific tokenizers and filters on a single Solr field? Thanks in advance! Best, Alex
