Expanding the query to use both the tagged and untagged term might work. I’m not sure the effect would be a lot different than boosting the preferred language.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 30, 2017, at 8:35 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > > This is unfortunately not what we want. Some customers use filters to > restrict language, but some customers don't. They want to be able to find > documents in all languages, so we use user preference to get their local > language on top. Except for very relevant documents in foreign languages, > hence the deboost is not too low. > > Thanks, > Markus > > > -----Original message----- >> From:Walter Underwood <wun...@wunderwood.org> >> Sent: Thursday 30th November 2017 17:29 >> To: solr-user@lucene.apache.org >> Subject: Re: Skewed IDF in multi lingual index, again >> >> I’ve occasionally considered using Unicode language tags (U+E001 and >> friends) on each term. That would make a term specific to a language, so we >> would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a >> pretty big hammer, because it restricts matches to the same language. If the >> entire document is in one language, might as well use a filter query for >> that language. The tags would work for multiple languages in one document. >> >> Maybe make the untagged term a synonym. For cross-language terms like >> “LaserJet”, the untagged one would have worse idf. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> >>> wrote: >>> >>> Hello, >>> >>> We already discussed this problem five years ago [1]. In short: documents >>> in foreign languages are scored higher for some terms. >>> >>> It was solved back then by using docCount instead of maxDoc when >>> calculating idf, it worked really well! But, probably due to index changes, >>> the problem is back for some terms, mostly proper nouns, well, just like >>> five years ago. >>> >>> We already deboost documents by 0.7 that are not in the user's preference >>> language but in some cases it is not enough. I can go on by reducing that >>> boost but that's not what i prefer. >>> >>> I'd like to know if there are additional tricks to solve the problem. >>> >>> Many thanks! >>> Markus >>> >>> [1] >>> http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html >> >>