[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920897#comment-13920897 ]
Shai Erera commented on LUCENE-5476: ------------------------------------ bq. Well, I do a search of course, but collect the hits using a TotalHitCountCollector and not retrieve any stored values That means that you evaluate the query twice, which is expensive ... but also, this doesn't guarantee to provide a "correct" sample. So say you found out the query matches 10M documents and you decide that 100K docs are a good sample, you'll set the sampling ratio to 0.01 but then you apply this ratio per-segment (as in this patch), and could easily end up with less than 100K docs (e.g. if randomness didn't really pick 0.01 of documents in a certain segment). I don't think we should store the minSampleSize in an int[] and move to a bitset if we collected more docs. First, the collector works per-segment and I think sampling should work on the entire result set. So the int[] wouldn't be part of MatchingDocs, it'd need to be held inside the collector and then you'll need to know where to "cut" it for each MatchingDocs instance (per-segment). I really think a simple solution is what we should start with. RandomSamplingFC only overrides {{.getMatchingDocs()}} and it can determine if sampling is needed or not, given {{minSampleSize}} and the sum of totalHits from all MatchingDocs. Then you do sampling per-segment, but with the "global picture" in mind, and you're able to correct the sample ratio so that we come as close to {{minSampleSize}} as possible. To me, if we can factor in a {{maxSampleSize}} in this issue is a bonus, but I can definitely see that happening in a separate issue, as that's a performance thing. We should focus on giving our users a collector which produces a good sample, otherwise it's not valuable sampling. > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org