RE: IDF maxDocs / numDocs

2014-03-13 Thread Markus Jelsma
Oh yes, i see what you mean. I would try SOLR-1632 and have distributed IDF, 
but it seems to be broken now.
 
-Original message-
> From:Steven Bower 
> Sent: Wednesday 12th March 2014 21:47
> To: solr-user 
> Subject: Re: IDF maxDocs / numDocs
> 
> My problem is that both maxDoc() and docCount() both report documents that
> have been deleted in their values. Because of merging/etc.. those numbers
> can be different per replica (or at least that is what I'm seeing). I need
> a value that is consistent across replicas... I see in the comment it makes
> mention of not using IndexReader.numDocs() but there doesn't seem to me a
> way to get ahold of the IndexReader within a similarity implementation (as
> only TermStats, CollectionStats are passed in, and neither contains of ref
> to the reader)
> 
> I am contemplating just using a static value for the "number of docs" as
> this won't change dramatically often..
> 
> steve
> 
> 
> On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
> wrote:
> 
> > Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
> > idfExplain but there's also a docCount(). We use docCount in all our custom
> > similarities, also because it allows you to have multiple languages in one
> > index where one is much larger than the other. The small language will have
> > very high IDF scores using maxDoc but they are proportional enough using
> > docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
> > one of your replica's becomes inconsistent ;)
> >
> >
> > https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
> >
> >
> >
> > -Original message-
> > > From:Steven Bower 
> > > Sent: Wednesday 12th March 2014 16:08
> > > To: solr-user 
> > > Subject: IDF maxDocs / numDocs
> > >
> > > I am noticing the maxDocs between replicas is consistently different and
> > > that in the idf calculation it is used which causes idf scores for the
> > same
> > > query/doc between replicas to be different. obviously an optimize can
> > > normalize the maxDocs scores, but that is only temporary.. is there a way
> > > to have idf use numDocs instead (as it should be consistent across
> > > replicas)?
> > >
> > > thanks,
> > >
> > > steve
> > >
> >
> 


Re: IDF maxDocs / numDocs

2014-03-12 Thread Steven Bower
My problem is that both maxDoc() and docCount() both report documents that
have been deleted in their values. Because of merging/etc.. those numbers
can be different per replica (or at least that is what I'm seeing). I need
a value that is consistent across replicas... I see in the comment it makes
mention of not using IndexReader.numDocs() but there doesn't seem to me a
way to get ahold of the IndexReader within a similarity implementation (as
only TermStats, CollectionStats are passed in, and neither contains of ref
to the reader)

I am contemplating just using a static value for the "number of docs" as
this won't change dramatically often..

steve


On Wed, Mar 12, 2014 at 11:18 AM, Markus Jelsma
wrote:

> Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in
> idfExplain but there's also a docCount(). We use docCount in all our custom
> similarities, also because it allows you to have multiple languages in one
> index where one is much larger than the other. The small language will have
> very high IDF scores using maxDoc but they are proportional enough using
> docCount(). Using docCount() also fixes SolrCloud ranking problems, unless
> one of your replica's becomes inconsistent ;)
>
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29
>
>
>
> -Original message-
> > From:Steven Bower 
> > Sent: Wednesday 12th March 2014 16:08
> > To: solr-user 
> > Subject: IDF maxDocs / numDocs
> >
> > I am noticing the maxDocs between replicas is consistently different and
> > that in the idf calculation it is used which causes idf scores for the
> same
> > query/doc between replicas to be different. obviously an optimize can
> > normalize the maxDocs scores, but that is only temporary.. is there a way
> > to have idf use numDocs instead (as it should be consistent across
> > replicas)?
> >
> > thanks,
> >
> > steve
> >
>


RE: IDF maxDocs / numDocs

2014-03-12 Thread Markus Jelsma
Hi Steve - it seems most similarities use CollectionStatistics.maxDoc() in 
idfExplain but there's also a docCount(). We use docCount in all our custom 
similarities, also because it allows you to have multiple languages in one 
index where one is much larger than the other. The small language will have 
very high IDF scores using maxDoc but they are proportional enough using 
docCount(). Using docCount() also fixes SolrCloud ranking problems, unless one 
of your replica's becomes inconsistent ;)

https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/CollectionStatistics.html#docCount%28%29

 
 
-Original message-
> From:Steven Bower 
> Sent: Wednesday 12th March 2014 16:08
> To: solr-user 
> Subject: IDF maxDocs / numDocs
> 
> I am noticing the maxDocs between replicas is consistently different and
> that in the idf calculation it is used which causes idf scores for the same
> query/doc between replicas to be different. obviously an optimize can
> normalize the maxDocs scores, but that is only temporary.. is there a way
> to have idf use numDocs instead (as it should be consistent across
> replicas)?
> 
> thanks,
> 
> steve
>