RE: IDF maxDocs / numDocs
Oh yes, i see what you mean. I would try SOLR-1632 and have distributed IDF, but it seems to be broken now. -Original message- > From:Steven Bower > Sent: Wednesday 12th March 2014 21:47 > To: solr-user > Subject: Re: IDF maxDocs / numDocs > > My problem is that both maxDoc() and docCount() both report documents that > have been deleted in their values. Because of merging/etc.. those numbers > can be different per replica (or at least that is what I'm seeing). I need > a value that is consistent across replicas... I see in the comment it makes > mention of not using IndexReader.numDocs() but there doesn't seem to me a > way to get ahold of the IndexReader within a similarity implementation (as > only TermStats, CollectionStats are passed in, and neither contains of ref > to the reader) > > I am contemplating just using a static value for the "number of docs" as > this won't change dramatically often.. > > steve > > > On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma > wrote: > > > Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in > > idfExplain but there's also a docCount(). We use docCount in all our custom > > similarities, also because it allows you to have multiple languages in one > > index where one is much larger than the other. The small language will have > > very high IDF scores using maxDoc but they are proportional enough using > > docCount(). Using docCount() also fixes SolrCloud ranking problems, unless > > one of your replica's becomes inconsistent ;) > > > > > > https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 > > > > > > > > -Original message- > > > From:Steven Bower > > > Sent: Wednesday 12th March 2014 16:08 > > > To: solr-user > > > Subject: IDF maxDocs / numDocs > > > > > > I am noticing the maxDocs between replicas is consistently different and > > > that in the idf calculation it is used which causes idf scores for the > > same > > > query/doc between replicas to be different. obviously an optimize can > > > normalize the maxDocs scores, but that is only temporary.. is there a way > > > to have idf use numDocs instead (as it should be consistent across > > > replicas)? > > > > > > thanks, > > > > > > steve > > > > > >
Re: IDF maxDocs / numDocs
My problem is that both maxDoc() and docCount() both report documents that have been deleted in their values. Because of merging/etc.. those numbers can be different per replica (or at least that is what I'm seeing). I need a value that is consistent across replicas... I see in the comment it makes mention of not using IndexReader.numDocs() but there doesn't seem to me a way to get ahold of the IndexReader within a similarity implementation (as only TermStats, CollectionStats are passed in, and neither contains of ref to the reader) I am contemplating just using a static value for the "number of docs" as this won't change dramatically often.. steve On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma wrote: > Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in > idfExplain but there's also a docCount(). We use docCount in all our custom > similarities, also because it allows you to have multiple languages in one > index where one is much larger than the other. The small language will have > very high IDF scores using maxDoc but they are proportional enough using > docCount(). Using docCount() also fixes SolrCloud ranking problems, unless > one of your replica's becomes inconsistent ;) > > > https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 > > > > -Original message- > > From:Steven Bower > > Sent: Wednesday 12th March 2014 16:08 > > To: solr-user > > Subject: IDF maxDocs / numDocs > > > > I am noticing the maxDocs between replicas is consistently different and > > that in the idf calculation it is used which causes idf scores for the > same > > query/doc between replicas to be different. obviously an optimize can > > normalize the maxDocs scores, but that is only temporary.. is there a way > > to have idf use numDocs instead (as it should be consistent across > > replicas)? > > > > thanks, > > > > steve > > >
RE: IDF maxDocs / numDocs
Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in idfExplain but there's also a docCount(). We use docCount in all our custom similarities, also because it allows you to have multiple languages in one index where one is much larger than the other. The small language will have very high IDF scores using maxDoc but they are proportional enough using docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one of your replica's becomes inconsistent ;) https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29 -Original message- > From:Steven Bower > Sent: Wednesday 12th March 2014 16:08 > To: solr-user > Subject: IDF maxDocs / numDocs > > I am noticing the maxDocs between replicas is consistently different and > that in the idf calculation it is used which causes idf scores for the same > query/doc between replicas to be different. obviously an optimize can > normalize the maxDocs scores, but that is only temporary.. is there a way > to have idf use numDocs instead (as it should be consistent across > replicas)? > > thanks, > > steve >