[ 
https://issues.apache.org/jira/browse/SOLR-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738708#comment-13738708
 ] 

James Dyer commented on SOLR-5122:
----------------------------------

The scenarios tested in testEstimatedHitCounts() seem to always pick a 
collector that does not accept docs out-of-order 
("TopFieldCollector$OneComparatorNonScoringCollector").  The problem looks like 
when a new segment/scorer is set, we get a new set of doc id's.  So prior to 
random merges, the test naively assummed everything was on 1 segment.  Now with 
multiple, all bets are off and I don't think we can be estimating hits.

I think the best fix is to dial back the functionality here and not offer hit 
estimates at all.  The functionality still would be beneficial in cases the 
user did not require hit-counts to be returned at all (for instance, ~rmuir 
mentioned using this feature with suggesters).  

Another option is to add together the doc ids for the various scorers that are 
looked at and pretend this is your max doc id.  I'm torn here because I'd hate 
to remove functionality that has been released but on the other hand if it is 
always going to give lousy estimates then why fool people?

Thoughts?
                
> spellcheck.collateMaxCollectDocs estimates seem to be meaninless -- can lead 
> to "ArithmeticException: / by zero"
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5122
>                 URL: https://issues.apache.org/jira/browse/SOLR-5122
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.4
>            Reporter: Hoss Man
>            Assignee: James Dyer
>         Attachments: SOLR-5122.patch, SOLR-5122.patch
>
>
> As part of SOLR-4952 SpellCheckCollatorTest started using RandomMergePolicy, 
> and this (aparently) led to a failure in testEstimatedHitCounts.
> As far as i can tell: the test assumes that specific values would be returned 
> as the _estimated_ "hits" for a colleation, and it appears that the change in 
> MergePolicy however resulted in different segments with different term stats, 
> causing the estimation code to produce different values then what is expected.
> I made a quick attempt to improve the test to:
>  * expect explicit exact values only when spellcheck.collateMaxCollectDocs is 
> set such that the "estimate' should actually be exact (ie: 
> collateMaxCollectDocs  == 0 or collateMaxCollectDocs greater then the num 
> docs in the index
>  * randomize the values used for collateMaxCollectDocs and confirm that the 
> estimates are never more then the num docs in the index
> This lead to an odd "ArithmeticException: / by zero" error in the test, which 
> seems to suggest that there is a genuine bug in the code for estimating the 
> hits that only gets tickled in certain 
> mergepolicy/segment/collateMaxCollectDocs combinations.
> *Update:* This appears to be a general problem with collecting docs out of 
> order and the estimation of hits -- i believe even if there is no divide by 
> zero error, the estimates are largely meaningless since the docs are 
> collected out of order.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to