Hi,

this is correct. Usually one does not know, how a stemmer - or
other language specific filters - behaves in the context of a
foreign language.

But there is an exception that sometimes comes to the rescue:
If one has a stable dictionary of terms in all the languages
of interest, then one might put these terms in a synoynm list
and also into a list of protected words for the stemmers. Then
searches for one those terms in any language will return the
documents regardless of their own language.

Of course this does not solve the general problem of cross-language
search, but it helps in certain circumstances.

Cheers,
   Sven

--On Donnerstag, 11. Februar 2010 13:45 -0800 Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:

Claudio,

Ah, through multilingual indexing/search work (with
http://www.sematext.com/products/multilingual-indexer/index.html ) I
learned that cross-language search often doesn't really make sense,
unless the search involves "universal terms" (e.g. Fiat, BMW, Mercedes,
Olivetti, Tomi de Paola, Alberto Tomba...).  If the search involved
natural language-specific terms, then searching in the "foreign" language
doesn't work so well and doesn't make a ton.  Imagine a search for "ciao
ragazzi".  I have no idea what the Italian stemmer does with that, but
say it turns it into "cia raga" (it doesn't, but just imagine).  If this
was done with Italian docs at index time, you will find the matching
docs.  But what happens if "ciao ragazzi" was analyzed by some German
analyzer?  Different tokens will be created and indexed, so a "ciao
ragazzi" search won't work.  And this Analyzer would you use to analyze
that query anyway?  Italian or German?

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
From: Claudio Martella <claudio.marte...@tis.bz.it>
To: solr-user@lucene.apache.org
Sent: Thu, February 11, 2010 3:21:32 AM
Subject: Re: dismax and multi-language corpus

I'll try removing the '-'. I do need now to search it. the other option
would be to request the user what language to query. but in my region we
use italian and german in the same quantity, so it would turn out in
querying both the languages all the time. or you meant a more performant
solution of query both the languages all the time? :)


Otis Gospodnetic wrote:
> Claudio - fields with '-' in them can be problematic.
>
> Side comment: do you really want to search across all languages at
> once?  If
not, maybe 3 different dismax configs would make your searches better.
>
>  Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
>
>> From: Claudio Martella
>> To: solr-user@lucene.apache.org
>> Sent: Wed, February 10, 2010 3:15:40 PM
>> Subject: dismax and multi-language corpus
>>
>> Hello list,
>>
>> I have a corpus with 3 languages, so i setup a text content field
>> (with no stemming) and 3 text-[en|it|de] fields with specific
>> snowball stemmers. i copyField the text to my language-away fields.
>> So, I setup this dismax searchHandler:
>>
>>
>>
>>   dismax
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   0.1
>>
>>
>>
>>
>> but i get this error:
>>
>> HTTP Status 400 - org.apache.lucene.queryParser.ParseException:
>> Expected ',' at position 7 in 'content-en'
>>
>> type Status report
>>
>> message org.apache.lucene.queryParser.ParseException: Expected ',' at
>> position 7 in 'content-en'
>>
>> description The request sent by the client was syntactically incorrect
>> (org.apache.lucene.queryParser.ParseException: Expected ',' at
>> position 7 in 'content-en').
>>
>> Any idea?
>>
>> TIA
>>
>> Claudio
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.marte...@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to
>> Section 13 of  Italian Legislative Decree no. 196 of 30 June 2003, we
>> inform you that we  process your personal data in order to fulfil
>> contractual and fiscal
obligations
>> and also to send you information regarding our services and events.
>> Your  personal data are processed with and without electronic means
>> and by
respecting
>> data subjects' rights, fundamental freedoms and dignity, particularly
>> with  regard to confidentiality, personal identity and the right to
>> personal data  protection. At any time and without formalities you
>> can write an e-mail to  priv...@tis.bz.it in order to object the
>> processing of your personal data for

>> the purpose of sending advertising materials and also to exercise the
>> right
to
>> access personal data and other rights referred to in Section 7 of
>> Decree  196/2003. The data controller is TIS Techno Innovation Alto
>> Adige, Siemens  Street n. 19, Bolzano. You can find the complete
>> information on the web site  www.tis.bz.it.
>>
>
>
>


--
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Reply via email to