Re: Indexing and searching documents in different languages

2013-04-10 Thread dev

Thx, I'll try this approach.

Zitat von Alexandre Rafalovitch arafa...@gmail.com:


Have you looked at edismax and the 'qf' fields parameter? It allows you to
define the fields to search. Also, you can define those parameters in
solrconfig.xml and not have to send them down the wire.

Finally, you can define several different request handlers (e.g. /ensearch,
/frsearch) and have each of them use different 'qf' values, possibly with
'fl' field also defined and with field name aliasing from language-specific
to generic names.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote:



Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using
TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it.

So, this is my configuration in solrconfig.xml

 updateRequestProcessorChain name=langid
   processor class=org.apache.solr.update.**processor.**
TikaLanguageIdentifierUpdatePr**ocessorFactory
 bool name=langidtrue/bool
 str name=langid.fltitle,**subtitle,content/str
 str name=langid.langField**language_s/str
 str name=langid.threshold0.3/**str
 str name=langid.fallback**general/str
 str name=langid.whitelisten,fr,**de,it,es/str
 bool name=langid.maptrue/bool
 bool name=langid.map.keepOrig**true/bool
   /processor
   processor class=solr.**LogUpdateProcessorFactory /
   processor class=solr.**RunUpdateProcessorFactory /
 /updateRequestProcessorChain

So, the detection works fine and I put some dynamic fields in schema.xml
to store the results:
  dynamicField name=*_en  type=text_enindexed=true
 stored=true multiValued=true/
  dynamicField name=*_fr  type=text_frindexed=true
 stored=true multiValued=true/
  dynamicField name=*_de  type=text_deindexed=true
 stored=true multiValued=true/
  dynamicField name=*_it  type=text_itindexed=true
 stored=true multiValued=true/
  dynamicField name=*_es  type=text_esindexed=true
 stored=true multiValued=true/

My main problem now is how to search the document without knowing the
language of the searched document.
I don't want to have a huge querystring like
 ?q=title_en:+term+subtitle_en:**+term+title_de:+term...
Okay, using copyField and copy all fields into the text field...but
text has the type text_general, so the language specific indexing is not
working. I could use at least a combined field for every language (like
text_en, text_fr...) but still, my querystring gets very long and to add
new languages is terribly uncomfortable.

So, what can I do? Is there a better solution to index and search
documents in many languages without knowing the language of the document
and the query before?

- Geschan








Indexing and searching documents in different languages

2013-04-09 Thread dev


Hello,

I'm trying to index a large number of documents in different languages.
I don't know the language of the document, so I'm using  
TikaLanguageIdentifierUpdateProcessorFactory to identify it.


So, this is my configuration in solrconfig.xml

 updateRequestProcessorChain name=langid
   processor  
class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory

 bool name=langidtrue/bool
 str name=langid.fltitle,subtitle,content/str
 str name=langid.langFieldlanguage_s/str
 str name=langid.threshold0.3/str
 str name=langid.fallbackgeneral/str
 str name=langid.whitelisten,fr,de,it,es/str
 bool name=langid.maptrue/bool
 bool name=langid.map.keepOrigtrue/bool
   /processor
   processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

So, the detection works fine and I put some dynamic fields in  
schema.xml to store the results:
  dynamicField name=*_en  type=text_enindexed=true   
stored=true multiValued=true/
  dynamicField name=*_fr  type=text_frindexed=true   
stored=true multiValued=true/
  dynamicField name=*_de  type=text_deindexed=true   
stored=true multiValued=true/
  dynamicField name=*_it  type=text_itindexed=true   
stored=true multiValued=true/
  dynamicField name=*_es  type=text_esindexed=true   
stored=true multiValued=true/


My main problem now is how to search the document without knowing the  
language of the searched document.
I don't want to have a huge querystring like   
?q=title_en:+term+subtitle_en:+term+title_de:+term...
Okay, using copyField and copy all fields into the text field...but  
text has the type text_general, so the language specific indexing is  
not working. I could use at least a combined field for every language  
(like text_en, text_fr...) but still, my querystring gets very long  
and to add new languages is terribly uncomfortable.


So, what can I do? Is there a better solution to index and search  
documents in many languages without knowing the language of the  
document and the query before?


- Geschan



Re: Indexing and searching documents in different languages

2013-04-09 Thread Otis Gospodnetic
Hi,

Typically people try to figure out the query language somehow.
Queries are short, so LID on them is hard.  But user profile could
indicate a language, or users can be asked and such.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Tue, Apr 9, 2013 at 2:32 PM,  d...@geschan.de wrote:

 Hello,

 I'm trying to index a large number of documents in different languages.
 I don't know the language of the document, so I'm using
 TikaLanguageIdentifierUpdateProcessorFactory to identify it.

 So, this is my configuration in solrconfig.xml

  updateRequestProcessorChain name=langid
processor
 class=org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory
  bool name=langidtrue/bool
  str name=langid.fltitle,subtitle,content/str
  str name=langid.langFieldlanguage_s/str
  str name=langid.threshold0.3/str
  str name=langid.fallbackgeneral/str
  str name=langid.whitelisten,fr,de,it,es/str
  bool name=langid.maptrue/bool
  bool name=langid.map.keepOrigtrue/bool
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 So, the detection works fine and I put some dynamic fields in schema.xml to
 store the results:
   dynamicField name=*_en  type=text_enindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_fr  type=text_frindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_de  type=text_deindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_it  type=text_itindexed=true  stored=true
 multiValued=true/
   dynamicField name=*_es  type=text_esindexed=true  stored=true
 multiValued=true/

 My main problem now is how to search the document without knowing the
 language of the searched document.
 I don't want to have a huge querystring like
 ?q=title_en:+term+subtitle_en:+term+title_de:+term...
 Okay, using copyField and copy all fields into the text field...but text
 has the type text_general, so the language specific indexing is not working.
 I could use at least a combined field for every language (like text_en,
 text_fr...) but still, my querystring gets very long and to add new
 languages is terribly uncomfortable.

 So, what can I do? Is there a better solution to index and search documents
 in many languages without knowing the language of the document and the query
 before?

 - Geschan



Re: Indexing and searching documents in different languages

2013-04-09 Thread Alexandre Rafalovitch
Have you looked at edismax and the 'qf' fields parameter? It allows you to
define the fields to search. Also, you can define those parameters in
solrconfig.xml and not have to send them down the wire.

Finally, you can define several different request handlers (e.g. /ensearch,
/frsearch) and have each of them use different 'qf' values, possibly with
'fl' field also defined and with field name aliasing from language-specific
to generic names.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Apr 9, 2013 at 2:32 PM, d...@geschan.de wrote:


 Hello,

 I'm trying to index a large number of documents in different languages.
 I don't know the language of the document, so I'm using
 TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it.

 So, this is my configuration in solrconfig.xml

  updateRequestProcessorChain name=langid
processor class=org.apache.solr.update.**processor.**
 TikaLanguageIdentifierUpdatePr**ocessorFactory
  bool name=langidtrue/bool
  str name=langid.fltitle,**subtitle,content/str
  str name=langid.langField**language_s/str
  str name=langid.threshold0.3/**str
  str name=langid.fallback**general/str
  str name=langid.whitelisten,fr,**de,it,es/str
  bool name=langid.maptrue/bool
  bool name=langid.map.keepOrig**true/bool
/processor
processor class=solr.**LogUpdateProcessorFactory /
processor class=solr.**RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 So, the detection works fine and I put some dynamic fields in schema.xml
 to store the results:
   dynamicField name=*_en  type=text_enindexed=true
  stored=true multiValued=true/
   dynamicField name=*_fr  type=text_frindexed=true
  stored=true multiValued=true/
   dynamicField name=*_de  type=text_deindexed=true
  stored=true multiValued=true/
   dynamicField name=*_it  type=text_itindexed=true
  stored=true multiValued=true/
   dynamicField name=*_es  type=text_esindexed=true
  stored=true multiValued=true/

 My main problem now is how to search the document without knowing the
 language of the searched document.
 I don't want to have a huge querystring like
  ?q=title_en:+term+subtitle_en:**+term+title_de:+term...
 Okay, using copyField and copy all fields into the text field...but
 text has the type text_general, so the language specific indexing is not
 working. I could use at least a combined field for every language (like
 text_en, text_fr...) but still, my querystring gets very long and to add
 new languages is terribly uncomfortable.

 So, what can I do? Is there a better solution to index and search
 documents in many languages without knowing the language of the document
 and the query before?

 - Geschan