[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918164#comment-13918164 ]
Shai Erera commented on LUCENE-5476: ------------------------------------ +1 for removing SamplingParams. I'm OK if the original collected hits is lost, let's just make sure we document this on the collector. Perhaps in the future (separate issue) we can add another collector (or enhance this one) to allow you to get both the original docs and the sampled docs. I wonder what are the performance implications of sampling during collection or post collection. In the past, Mike and I saw that not interfering with collection improves performance, though what we measured was accessing the ordinals-DV during collection. Just wondering if in-collection performs better than post-collection. Especially if sampleRatio is low, it means we set far fewer hits on the FBS, just to clean most of them afterwards. On the other hand we add calls to random.nextInt/Double, so that's a tradeoff which would be good to measure. We don't even need random.nextDouble(), we can do what you/mike suggested above -- work in "bins", for each bin draw a random index and discard all hits given to addDoc unless it is the binIndex. I also think that if we keep the original FBS (whether we clone-and-clear, create-new-sampled-one or whatever), we should iterate on the matching docs and not all the bits. I don't see the logic of why that's good at all, unless the bitset.cardinality() is very big (like maybe 75% of the bitset size)? Of course, if we move to sample in-collection, that's not an issue at all. Rob, I want to make sure we are all on the same page as to what we want to achieve in this issue (helps scope it): * Add SamplingFacetsCollector which takes {{sampleRatio}} and optional {{seed}} and random-samples the matching docs so that facets are counted on fewer hits * We should note that as a result the weights computed for facets are approximation only, and the application needs to correct them if it wants (e.g. multiplying by the inverse of {{sampleRatio}} or exact count etc.). At any rate, this is something that we don't plan to tackle in this issue, right? Maybe we should call it RandomSamplingFacetsCollector to denote it's a random sample (vs e.g the approaches mentioned by Gilad above)? Then the parameter {{seed}} is clear as to what it means. > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org