Skewed IDF in multi lingual index

Markus Jelsma Thu, 08 Nov 2012 08:09:14 -0800

Hi,

We're testing a large multi lingual index with _LANG fields for each language 
and using dismax to query them all. Users provide, explicit or implicit, 
language preferences that we use for either additive or multiplicative boosting 
on the language of the document. However, additive boosting is not adequate 
because it cannot overcome the extremely high IDF values for the same word in 
another language so regardless of the the preference, foreign documents are 
returned. Multiplicative boosting solves this problem but has the other 
downside as it doesn't allow us with standard qf=field^boost to prefer 
documents in another language above the preferred language because the 
multiplicative is so strong. We do use the def function 
(boost=def(query($qq),.3)) to prevent one boost query to return 0 and thus a 
product of 0 for all boost queries. But it doesn't help that much


This all comes down to IDF differences between the languages, even common words 
such as country names like `india` show large differences in IDF. Is here 
anyone with some hints or experiences to share about skewed IDF in such an 
index?

Thanks,
Markus

Skewed IDF in multi lingual index

Reply via email to