RE: IDF maxDocs / numDocs

Markus Jelsma Thu, 13 Mar 2014 02:01:26 -0700
Oh yes, i see what you mean. I would try SOLR-1632 and have distributed IDF, 
but it seems to be broken now.
 
-----Original message-----
> From:Steven Bower <smb-apa...@alcyon.net>
> Sent: Wednesday 12th March 2014 21:47
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: IDF maxDocs / numDocs
> 
> My problem is that both maxDoc() and docCount() both report documents that
> have been deleted in their values. Because of merging/etc.. those numbers
> can be different per replica (or at least that is what I'm seeing). I need
> a value that is consistent across replicas... I see in the comment it makes
> mention of not using IndexReader.numDocs() but there doesn't seem to me a
> way to get ahold of the IndexReader within a similarity implementation (as
> only TermStats, CollectionStats are passed in, and neither contains of ref
> to the reader)
> 
> I am contemplating just using a static value for the "number of docs" as
> this won't change dramatically often..
> 
> steve
> 
> 
> On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
> 
> > Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
> > idfExplain but there's also a docCount(). We use docCount in all our custom
> > similarities, also because it allows you to have multiple languages in one
> > index where one is much larger than the other. The small language will have
> > very high IDF scores using maxDoc but they are proportional enough using
> > docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
> > one of your replica's becomes inconsistent ;)
> >
> >
> > https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
> >
> >
> >
> > -----Original message-----
> > > From:Steven Bower <smb-apa...@alcyon.net>
> > > Sent: Wednesday 12th March 2014 16:08
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: IDF maxDocs / numDocs
> > >
> > > I am noticing the maxDocs between replicas is consistently different and
> > > that in the idf calculation it is used which causes idf scores for the
> > same
> > > query/doc between replicas to be different. obviously an optimize can
> > > normalize the maxDocs scores, but that is only temporary.. is there a way
> > > to have idf use numDocs instead (as it should be consistent across
> > > replicas)?
> > >
> > > thanks,
> > >
> > > steve
> > >
> >
>
RE: IDF maxDocs / numDocs

Reply via email to