Hi,

My name is Aleksandar Kanchev.

I am a web developer and have been using Apache Solr for ten years now.

We have been having issues handling multi-lingual content for quite some
time now and we are sure it's not an Apache Solr deficiency but a knowledge
deficiency on our side.


*What we are trying to do*
Index and query content on different languages (English, Persian, Chinese,
Japanese, etc.) in one Solr field.

*The problem *

Different language groups need different tokenizers and filters. We can't
apply multiple tokenizers to the same Solr field type.

*How we tried to solve the problem*

   - Create a separate Solr field with a specific field type (different
   tokenizer and filters configuration) for each particular language group.
   Example:
   <field name="content_cjk" type="text_cjk" multiValued="false"
   indexed="true" stored="false" useDocValuesAsStored="false"/>
   <field name="content_pe" type="text_pe" multiValued="false"
   indexed="true" stored="false" useDocValuesAsStored="false"/>
   ...


   - Then create a copy field.
   Example:
   <field name="content" type="text_general" multiValued="true"
   indexed="true" stored="false" useDocValuesAsStored="false"/>
   <copyField source="content_cjk" dest="content"/>
   <copyField source="content_pe" dest="content"/>
   ...


*The issue*
Copy fields are copied before the analysis, so we lose the tokenization and
filtering from the language-specific fields.

*Questions*

   - Can we use copy fields and preserve the generated tokens from the
   source fields?
   - Is there a better way to apply multiple language-specific tokenizers
   and filters on a single Solr field?


Thanks in advance!


Best,

Alex

Reply via email to