Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdateProcessorFactory to identify it.

So, this is my configuration in solrconfig.xml

 <updateRequestProcessorChain name="langid">
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
         <bool name="langid">true</bool>
         <str name="langid.fl">title,subtitle,content</str>
         <str name="langid.langField">language_s</str>
         <str name="langid.threshold">0.3</str>
         <str name="langid.fallback">general</str>
         <str name="langid.whitelist">en,fr,de,it,es</str>
         <bool name="langid.map">true</bool>
         <bool name="langid.map.keepOrig">true</bool>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

So, the detection works fine and I put some dynamic fields in schema.xml to store the results: <dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_de" type="text_de" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_it" type="text_it" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_es" type="text_es" indexed="true" stored="true" multiValued="true"/>

My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:+term+title_de:+term... Okay, using copyField and copy all fields into the "text" field...but "text" has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable.

So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before?

- Geschan

Reply via email to