Hello,
I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using
TikaLanguageIdentifierUpdateProcessorFactory to identify it.
So, this is my configuration in solrconfig.xml
<updateRequestProcessorChain name="langid">
<processor
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<bool name="langid">true</bool>
<str name="langid.fl">title,subtitle,content</str>
<str name="langid.langField">language_s</str>
<str name="langid.threshold">0.3</str>
<str name="langid.fallback">general</str>
<str name="langid.whitelist">en,fr,de,it,es</str>
<bool name="langid.map">true</bool>
<bool name="langid.map.keepOrig">true</bool>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
So, the detection works fine and I put some dynamic fields in
schema.xml to store the results:
<dynamicField name="*_en" type="text_en" indexed="true"
stored="true" multiValued="true"/>
<dynamicField name="*_fr" type="text_fr" indexed="true"
stored="true" multiValued="true"/>
<dynamicField name="*_de" type="text_de" indexed="true"
stored="true" multiValued="true"/>
<dynamicField name="*_it" type="text_it" indexed="true"
stored="true" multiValued="true"/>
<dynamicField name="*_es" type="text_es" indexed="true"
stored="true" multiValued="true"/>
My main problem now is how to search the document without knowing the
language of the searched document.
I don't want to have a huge querystring like
?q=title_en:+term+subtitle_en:+term+title_de:+term...
Okay, using copyField and copy all fields into the "text" field...but
"text" has the type text_general, so the language specific indexing is
not working. I could use at least a combined field for every language
(like text_en, text_fr...) but still, my querystring gets very long
and to add new languages is terribly uncomfortable.
So, what can I do? Is there a better solution to index and search
documents in many languages without knowing the language of the
document and the query before?
- Geschan