Re: Designing a multilingual index

Paul Libbrecht Tue, 03 Jan 2012 05:52:25 -0800

Le 3 janv. 2012 à 13:56, heikki a écrit :

> In our case, it is "known" in which language the user is searching (because
> he tells us, and if he doesn't, we use the current GUI language).


On the web it is often hard to trust such (e.g. because of people working in 
multiple languages, internet cafés...) but... it is your choice.

> Results
> are returned so that results in the requested language are ordered on top,
> and within that, ordered by relevance. Results in other languages are also
> returned, and presented after the requested-language results, ordered by
> relevance.

After?
Would "shallow matches" in the right language come after "precise matches" in a 
wrong language?

> If the results in the requested language contain say one which has term A
> and one which has term B, their positions in the relevance ranking (within
> the requested-language results on top) can be influenced by occurrences of
> terms A and B in the other languages, if a single search is used.
> 
> I agree to the apples/oranges remark: if a term occurs in more than one
> language, likely its IDF frequency is different for each language, so to
> have the best relevance ranking there should be separate indexes for each
> language. And searches should be really separate searches (no MultiSearcher
> which would produce combined relevance scores). So the results should also
> be presented as several, separate result sets.


I believe the right solution for this is simple: use different fields per 
langauge.

In both solr and simple lucene, using different fields allows different 
analyzers, that's how you want things (e.g. a different stemmer per language).

Using different indexes is certainly a hassle, different fields not really.

The important bit is to use query-expansion.
Given a query of the user (with params or not, with text-queries), expand it to 
a query where the "normal text" is expected to be in the right language, but 
maybe also in one of the other languages (that the browser says, that your 
platform supports), with less weight of course.

Query expansion is done by post-processing the result of the query-parser in my 
case.

Then you can also differentiate fields which are precise matches and less: make 
one field with exact match (using the whitespace-tokenizer), one field with 
stemmed match (e.g. using the porter family), one field with phonetic matches.

Hope it helps.

paul

> Does anyone have experience with this ? Opinions ? Is the improved relevance
> per language worth the "hassle" of having separate indexes, doing separate
> searches and presenting results per language ? We do already take care of
> using appropriate stopwords/differnt analyzers when indexing and searching a
> particular language, but that's a different issue obviously.

Re: Designing a multilingual index

Reply via email to