[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920822#comment-13920822 ]
Rob Audenaerde commented on LUCENE-5476: ---------------------------------------- {quote} How do you count the number of hits before you execute the search? {quote} Well, I do a search of course, but collect the hits using a {{TotalHitCountCollector}} and not retrieve any stored values. I did not find any other way to determine for sure if I needed to do sampling or not. I know this takes time. When I first implemented it however, it was faster to do a count and determine whether provide exact facets or needed to sample. Not optimal, but it worked. And because I could use sampling in the facets, the total time (1 pass counting, 1 pass sampling facets) was still much less than the time it would take to do a exact facet and count in one pass. When only considering the {{samplingThreshold}}, facetting should still be doable without counting first. It can be done by storing the first {{samplingThreshold}} documents (in the addDoc) in a separate array (in the collector) without sampling. This way the count is not needed to decide on whether to sample or not as there will always be sampled. Only the sampled result is discarded if the total number of hits <= minSampleSize. I agree that this is not the nicest way to get a sample. (but can reduce the time to retrieve estimated facet results by 5 when using a sampling rate of 1 in 1000). The alternative is to do it more like in your snippet (and more like the first approach); collect all documents and sample afterwards. This way you know the number of hits and adjusting the sample rate based on parameters is more straightforward. Either way is faster than using exact facets, so both ways are a win. > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org