Re: Indexing and searching documents in different languages
Thx, I'll try this approach. Zitat von Alexandre Rafalovitch arafa...@gmail.com: Have you looked at edismax and the 'qf' fields parameter? It allows you to define the fields to search. Also, you can define those parameters in solrconfig.xml and not have to send them down the wire. Finally, you can define several different request handlers (e.g. /ensearch, /frsearch) and have each of them use different 'qf' values, possibly with 'fl' field also defined and with field name aliasing from language-specific to generic names. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote: Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.**processor.** TikaLanguageIdentifierUpdatePr**ocessorFactory bool name=langidtrue/bool str name=langid.fltitle,**subtitle,content/str str name=langid.langField**language_s/str str name=langid.threshold0.3/**str str name=langid.fallback**general/str str name=langid.whitelisten,fr,**de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrig**true/bool /processor processor class=solr.**LogUpdateProcessorFactory / processor class=solr.**RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:**+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Indexing and searching documents in different languages
Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdateProcessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory bool name=langidtrue/bool str name=langid.fltitle,subtitle,content/str str name=langid.langFieldlanguage_s/str str name=langid.threshold0.3/str str name=langid.fallbackgeneral/str str name=langid.whitelisten,fr,de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrigtrue/bool /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Re: Indexing and searching documents in different languages
Hi, Typically people try to figure out the query language somehow. Queries are short, so LID on them is hard. But user profile could indicate a language, or users can be asked and such. Otis -- Solr ElasticSearch Support http://sematext.com/ On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote: Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdateProcessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory bool name=langidtrue/bool str name=langid.fltitle,subtitle,content/str str name=langid.langFieldlanguage_s/str str name=langid.threshold0.3/str str name=langid.fallbackgeneral/str str name=langid.whitelisten,fr,de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrigtrue/bool /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Re: Indexing and searching documents in different languages
Have you looked at edismax and the 'qf' fields parameter? It allows you to define the fields to search. Also, you can define those parameters in solrconfig.xml and not have to send them down the wire. Finally, you can define several different request handlers (e.g. /ensearch, /frsearch) and have each of them use different 'qf' values, possibly with 'fl' field also defined and with field name aliasing from language-specific to generic names. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote: Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it. So, this is my configuration in solrconfig.xml updateRequestProcessorChain name=langid processor class=org.apache.solr.update.**processor.** TikaLanguageIdentifierUpdatePr**ocessorFactory bool name=langidtrue/bool str name=langid.fltitle,**subtitle,content/str str name=langid.langField**language_s/str str name=langid.threshold0.3/**str str name=langid.fallback**general/str str name=langid.whitelisten,fr,**de,it,es/str bool name=langid.maptrue/bool bool name=langid.map.keepOrig**true/bool /processor processor class=solr.**LogUpdateProcessorFactory / processor class=solr.**RunUpdateProcessorFactory / /updateRequestProcessorChain So, the detection works fine and I put some dynamic fields in schema.xml to store the results: dynamicField name=*_en type=text_enindexed=true stored=true multiValued=true/ dynamicField name=*_fr type=text_frindexed=true stored=true multiValued=true/ dynamicField name=*_de type=text_deindexed=true stored=true multiValued=true/ dynamicField name=*_it type=text_itindexed=true stored=true multiValued=true/ dynamicField name=*_es type=text_esindexed=true stored=true multiValued=true/ My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:**+term+title_de:+term... Okay, using copyField and copy all fields into the text field...but text has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan