Hi, Typically people try to figure out the query language somehow. Queries are short, so LID on them is hard. But user profile could indicate a language, or users can be asked and such.
Otis -- Solr & ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 2:32 PM, <d...@geschan.de> wrote: > > Hello, > > I'm trying to index a large number of documents in different languages. > I don't know the language of the document, so I'm using > TikaLanguageIdentifierUpdateProcessorFactory to identify it. > > So, this is my configuration in solrconfig.xml > > <updateRequestProcessorChain name="langid"> > <processor > class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"> > <bool name="langid">true</bool> > <str name="langid.fl">title,subtitle,content</str> > <str name="langid.langField">language_s</str> > <str name="langid.threshold">0.3</str> > <str name="langid.fallback">general</str> > <str name="langid.whitelist">en,fr,de,it,es</str> > <bool name="langid.map">true</bool> > <bool name="langid.map.keepOrig">true</bool> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > > So, the detection works fine and I put some dynamic fields in schema.xml to > store the results: > <dynamicField name="*_en" type="text_en" indexed="true" stored="true" > multiValued="true"/> > <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" > multiValued="true"/> > <dynamicField name="*_de" type="text_de" indexed="true" stored="true" > multiValued="true"/> > <dynamicField name="*_it" type="text_it" indexed="true" stored="true" > multiValued="true"/> > <dynamicField name="*_es" type="text_es" indexed="true" stored="true" > multiValued="true"/> > > My main problem now is how to search the document without knowing the > language of the searched document. > I don't want to have a huge querystring like > ?q=title_en:+term+subtitle_en:+term+title_de:+term... > Okay, using copyField and copy all fields into the "text" field...but "text" > has the type text_general, so the language specific indexing is not working. > I could use at least a combined field for every language (like text_en, > text_fr...) but still, my querystring gets very long and to add new > languages is terribly uncomfortable. > > So, what can I do? Is there a better solution to index and search documents > in many languages without knowing the language of the document and the query > before? > > - Geschan >