[ https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665018#comment-13665018 ]
Gilad Barkai commented on LUCENE-5015: -------------------------------------- Sampling, with its defaults, has its toll. In its defaults, Sampling aims to produce the exact top-K results for each request, as if a {{StandardFacetAccumulator}} would have been used. Meaning it aims at producing the same top-K with the same counts. The process begins with sampling the result set and computers the top-*cK* candidates for each of the *M* facet requests, producing amortized results. That part is faster than {{StandardFacetAccumulator}} because less documents' facets information gets processed. The next part is the "fixing", using a {{SampleFixer}} retrieved from a {{Sampler}}, in which "fixed" counts are produced which correlate better with the original document result set, rather than the sampled one. The default (and currently only implementation) for such fixer is {{TakmiSampleFixer}} which produced _exact_ counts for each of the *cK* candidates for each of the *M* facet requests. The counts are not computed against the facet information of each document, but rather matching the skiplist of the drill-down term, of each such candidate category with the bitset of the (actual) document results. The amount of matches is the count. This is equivalent to total-hit collector with a drilldown query for the candidate category over original query. There's tipping point in which not sampling is faster than sampling and fixing using *c* x *K* x *M* skiplists matches against the bitset representing the document results. *c* defaults to 2 (see overSampleFactor in SamplingParams); Over-sampling (a.k.a *c*) is important for exact counts, as it is conceivable that the accuracy of a sampled top-k is not 100%, but according to some measures we once ran it is very likely that the true top-K results are within the sampled *2K* results. Fixing those 2K with their actual counts and re-sorting them accordingly yields much more accurate top-K. E.g Requesting 5 count requests for top-10 with overSampleFactor of 2, results in 100 skiplist matching against the document results bitset. If amortized results suffice, a different {{SampleFixer}} could be coded - which E.g amortize the true count from the sampling ration. E.g if category C got count of 3, and the sample was of 1,000 results out of a 1,000,000 than the "AmortizedSampleFixer" would fix the count of C to be 3,000. Such fixing is very fast, and the overSampleFactor should be set to 1.0. Edit: I now see that it is not that easy to code a different SampleFixer, nor get it the information needed for the amortized result fixing as suggested above. I'll try to open the API some and make it more convenient. > Unexpected performance difference between SamplingAccumulator and > StandardFacetAccumulator > ------------------------------------------------------------------------------------------ > > Key: LUCENE-5015 > URL: https://issues.apache.org/jira/browse/LUCENE-5015 > Project: Lucene - Core > Issue Type: Bug > Components: modules/facet > Affects Versions: 4.3 > Reporter: Rob Audenaerde > Priority: Minor > > I have an unexpected performance difference between the SamplingAccumulator > and the StandardFacetAccumulator. > The case is an index with about 5M documents and each document containing > about 10 fields. I created a facet on each of those fields. When searching to > retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is > about twice as fast as the StandardFacetAccumulator. This is expected and a > nice speed-up. > However, when I use more CountFacetRequests to retrieve facet-counts for more > than one field, the speeds of the SampingAccumulator decreases, to the point > where the StandardFacetAccumulator is faster. > {noformat} > FacetRequests Sampling Standard > 1 391 ms 1100 ms > 2 531 ms 1095 ms > 3 948 ms 1108 ms > 4 1400 ms 1110 ms > 5 1901 ms 1102 ms > {noformat} > Is this behaviour normal? I did not expect it, as the SamplingAccumulator > needs to do less work? > Some code to show what I do: > {code} > searcher.search( facetsQuery, facetsCollector ); > final List<FacetResult> collectedFacets = > facetsCollector.getFacetResults(); > {code} > {code} > final FacetSearchParams facetSearchParams = new FacetSearchParams( > facetRequests ); > FacetsCollector facetsCollector; > if ( isSampled ) > { > facetsCollector = > FacetsCollector.create( new SamplingAccumulator( new > RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) ); > } > else > { > facetsCollector = FacetsCollector.create( FacetsAccumulator.create( > facetSearchParams, searcher.getIndexReader(), taxo ) ); > {code} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org