Re: Skewed IDF in multi lingual index

Robert Muir Mon, 26 Nov 2012 19:33:22 -0800

Hi again Markus. Sorry for the slow reply here.

I'm confused: are you saying the score goes negative? Are you sure there is
no 3.x segments? Can you check that docCount is not -1? Do you happen to
have a test, can you share your modified similarity, or give more details?


I just want to make sure there isn't a bug in lucene here (we verify this
statistic currently in checkindex and other places, but there is always the
possibility)

On Mon, Nov 12, 2012 at 7:39 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> I'd like to add that multiplicative boosting on very scarce properties,
> e.g. you want to boost on a boolean value of which there are only very few,
> causes a problem in scoring when using docCount instead of maxDoc. If
> docCount is one IDF will be ~0.3, with the fieldWeight you'll end up with a
> score below 0. Because of this the product of all multiplicative boosts
> will be lower than the product of boosts similar boosts, lowering the
> document in rank instead of boosting it.
>
> -----Original message-----
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Fri 09-Nov-2012 10:23
> > To: solr-user@lucene.apache.org
> > Subject: RE: Skewed IDF in multi lingual index
> >
> > Robert, Tom,
> >
> > That's it indeed! Using maxDoc as numerator opposed to docCount yields
> very skewed results for an unevenly distributed multi-lingual index. We
> have one language dominating the other twenty so the dominating language
> contains no rare terms compared to the others.
> >
> > We're now checking results using docCount and it seems alright. I do
> have to get used to the fact that document scores are now roughly 1000
> times higher than before but i'm already very happy with
> CollectionStatistics and will see if all works well.
> >
> > Any other tips to share?
> >
> > Thanks,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Robert Muir <rcm...@gmail.com>
> > > Sent: Thu 08-Nov-2012 17:44
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Skewed IDF in multi lingual index
> > >
> > > Hi Markus: how are the languages distributed across documents?
> > >
> > > Imagine I have a text_en field and a text_fr field. Lets say I have
> > > 100 documents, 95 are english and only 5 are french.
> > > So the text_en field is populated 95% of the time, and the text_fr 5%
> > > of the time.
> > >
> > > But the default IDF computation doesnt look at things this way: it
> > > always uses '100' as maxDoc. So in such a situation, any terms against
> > > text_fr are "rare" :)
> > >
> > > The first thing i would look at, is treating this situation as merging
> > > results from a english index with 95 docs and a french index with 5
> > > docs.
> > > So I would consider overriding the two idfExplain methods (term and
> > > phrase) to use CollectionStatistics.docCount() instead of
> > > CollectionStatistics.maxDoc()
> > > The former would be 95 for the english field (instead of 100), and 5
> > > for the french field (instead of 100).
> > >
> > > I dont think this will solve all your problems: but it might help.
> > >
> > > Note: you must ensure your index is fully upgraded to 4.0 to try this
> > > statistic, otherwise it will return -1 if you have any 3.x segments in
> > > your index.
> > >
> > > On Thu, Nov 8, 2012 at 11:13 AM, Markus Jelsma
> > > <markus.jel...@openindex.io> wrote:
> > > > Hi,
> > > >
> > > > We're testing a large multi lingual index with _LANG fields for each
> language and using dismax to query them all. Users provide, explicit or
> implicit, language preferences that we use for either additive or
> multiplicative boosting on the language of the document. However, additive
> boosting is not adequate because it cannot overcome the extremely high IDF
> values for the same word in another language so regardless of the the
> preference, foreign documents are returned. Multiplicative boosting solves
> this problem but has the other downside as it doesn't allow us with
> standard qf=field^boost to prefer documents in another language above the
> preferred language because the multiplicative is so strong. We do use the
> def function (boost=def(query($qq),.3)) to prevent one boost query to
> return 0 and thus a product of 0 for all boost queries. But it doesn't help
> that much
> > > >
> > > > This all comes down to IDF differences between the languages, even
> common words such as country names like `india` show large differences in
> IDF. Is here anyone with some hints or experiences to share about skewed
> IDF in such an index?
> > > >
> > > > Thanks,
> > > > Markus
> > >
> >
>

Re: Skewed IDF in multi lingual index

Reply via email to