Re: Skewed IDF in multi lingual index, again

Walter Underwood Thu, 30 Nov 2017 08:43:05 -0800

Expanding the query to use both the tagged and untagged term might work. I’m 
not sure the effect would be a lot different than boosting the preferred 
language.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 30, 2017, at 8:35 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> This is unfortunately not what we want. Some customers use filters to 
> restrict language, but some customers don't. They want to be able to find 
> documents in all languages, so we use user preference to get their local 
> language on top. Except for very relevant documents in foreign languages, 
> hence the deboost is not too low.
> 
> Thanks,
> Markus
> 
> 
> -----Original message-----
>> From:Walter Underwood <wun...@wunderwood.org>
>> Sent: Thursday 30th November 2017 17:29
>> To: solr-user@lucene.apache.org
>> Subject: Re: Skewed IDF in multi lingual index, again
>> 
>> I’ve occasionally considered using Unicode language tags (U+E001 and 
>> friends) on each term. That would make a term specific to a language, so we 
>> would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a 
>> pretty big hammer, because it restricts matches to the same language. If the 
>> entire document is in one language, might as well use a filter query for 
>> that language. The tags would work for multiple languages in one document.
>> 
>> Maybe make the untagged term a synonym. For cross-language terms like 
>> “LaserJet”, the untagged one would have worse idf.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> 
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> We already discussed this problem five years ago [1]. In short: documents 
>>> in foreign languages are scored higher for some terms.
>>> 
>>> It was solved back then by using docCount instead of maxDoc when 
>>> calculating idf, it worked really well! But, probably due to index changes, 
>>> the problem is back for some terms, mostly proper nouns, well, just like 
>>> five years ago.
>>> 
>>> We already deboost documents by 0.7 that are not in the user's preference 
>>> language but in some cases it is not enough. I can go on by reducing that 
>>> boost but that's not what i prefer.
>>> 
>>> I'd like to know if there are additional tricks to solve the problem.
>>> 
>>> Many thanks!
>>> Markus
>>> 
>>> [1] 
>>> http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html
>> 
>>

Re: Skewed IDF in multi lingual index, again

Reply via email to