[
https://issues.apache.org/jira/browse/SOLR-5122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-5122:
---------------------------
Attachment: SOLR-5122.patch
here's a patch that improves EarlyTerminatingCollector to keep track of the
size of each reader it collects against so that it can derive some meaning from
the docIds it collects. As part of this patch i eliminated the use of the
"lastDocId" to try and discourage people from trying to find specific --
instead the EarlyTerminatingCollectorException now just reports the number of
docs "collected" out of the total number of docs "scanned" ... the result is
that the collector doesn't really care which order it gets the
AtomicReaderContexts in, however it still has to force documents to be
collected in order, so that they will be in-order within a single reader so
that the stats for that reader can be meaningful.
patch includes the previous tests, plus a new test loop that we get a
reasonably accurate estimate from a term that is in every other doc in the
index.
[~jdyer] - does this look right to you? does it address your concerns about
keeping hte estimation code in place?
> spellcheck.collateMaxCollectDocs estimates seem to be meaninless -- can lead
> to "ArithmeticException: / by zero"
> ----------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-5122
> URL: https://issues.apache.org/jira/browse/SOLR-5122
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.4
> Reporter: Hoss Man
> Assignee: James Dyer
> Attachments: SOLR-5122.patch, SOLR-5122.patch, SOLR-5122.patch
>
>
> As part of SOLR-4952 SpellCheckCollatorTest started using RandomMergePolicy,
> and this (aparently) led to a failure in testEstimatedHitCounts.
> As far as i can tell: the test assumes that specific values would be returned
> as the _estimated_ "hits" for a colleation, and it appears that the change in
> MergePolicy however resulted in different segments with different term stats,
> causing the estimation code to produce different values then what is expected.
> I made a quick attempt to improve the test to:
> * expect explicit exact values only when spellcheck.collateMaxCollectDocs is
> set such that the "estimate' should actually be exact (ie:
> collateMaxCollectDocs == 0 or collateMaxCollectDocs greater then the num
> docs in the index
> * randomize the values used for collateMaxCollectDocs and confirm that the
> estimates are never more then the num docs in the index
> This lead to an odd "ArithmeticException: / by zero" error in the test, which
> seems to suggest that there is a genuine bug in the code for estimating the
> hits that only gets tickled in certain
> mergepolicy/segment/collateMaxCollectDocs combinations.
> *Update:* This appears to be a general problem with collecting docs out of
> order and the estimation of hits -- i believe even if there is no divide by
> zero error, the estimates are largely meaningless since the docs are
> collected out of order.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]