[ 
https://issues.apache.org/jira/browse/LUCENE-5015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665018#comment-13665018
 ] 

Gilad Barkai commented on LUCENE-5015:
--------------------------------------

Sampling, with its defaults, has its toll. 

In its defaults, Sampling aims to produce the exact top-K results for each 
request, as if a {{StandardFacetAccumulator}} would have been used. Meaning it 
aims at producing the same top-K with the same counts.

The process begins with sampling the result set and computers the top-*cK* 
candidates for each of the *M* facet requests, producing amortized results. 
That part is faster than {{StandardFacetAccumulator}} because less documents' 
facets information gets processed.

The next part is the "fixing", using a {{SampleFixer}} retrieved from a 
{{Sampler}}, in which "fixed" counts are produced which correlate better with 
the original document result set, rather than the sampled one. The default (and 
currently only implementation) for such fixer is {{TakmiSampleFixer}} which 
produced _exact_ counts for each of the *cK* candidates for each of the *M* 
facet requests. The counts are not computed against the facet information of 
each document, but rather matching the skiplist of the drill-down term, of each 
such candidate category with the bitset of the (actual) document results. The 
amount of matches is the count. 
This is equivalent to total-hit collector with a drilldown query for the 
candidate category over original query. 
There's tipping point in which not sampling is faster than sampling and fixing 
using *c* x *K* x *M* skiplists matches against the bitset representing the 
document results. *c* defaults to 2 (see overSampleFactor in SamplingParams); 

Over-sampling (a.k.a *c*) is important for exact counts, as it is conceivable 
that the accuracy of a sampled top-k is not 100%, but according to some 
measures we once ran it is very likely that the true top-K results are within 
the sampled *2K* results. Fixing those 2K with their actual counts and 
re-sorting them accordingly yields much more accurate top-K. 


E.g Requesting 5 count requests for top-10 with overSampleFactor of 2, results 
in 100 skiplist matching against the document results bitset.


If amortized results suffice, a different {{SampleFixer}} could be coded - 
which E.g amortize the true count from the sampling ration. E.g if category C 
got count of 3, and the sample was of 1,000 results out of a 1,000,000 than the 
"AmortizedSampleFixer" would fix the count of C to be 3,000.
Such fixing is very fast, and the overSampleFactor should be set to 1.0.

Edit:
I now see that it is not that easy to code a different SampleFixer, nor get it 
the information needed for the amortized result fixing as suggested above. 
I'll try to open the API some and make it more convenient.
                
> Unexpected performance difference between SamplingAccumulator and 
> StandardFacetAccumulator
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5015
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5015
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/facet
>    Affects Versions: 4.3
>            Reporter: Rob Audenaerde
>            Priority: Minor
>
> I have an unexpected performance difference between the SamplingAccumulator 
> and the StandardFacetAccumulator. 
> The case is an index with about 5M documents and each document containing 
> about 10 fields. I created a facet on each of those fields. When searching to 
> retrieve facet-counts (using 1 CountFacetRequest), the SamplingAccumulator is 
> about twice as fast as the StandardFacetAccumulator. This is expected and a 
> nice speed-up. 
> However, when I use more CountFacetRequests to retrieve facet-counts for more 
> than one field, the speeds of the SampingAccumulator decreases, to the point 
> where the StandardFacetAccumulator is faster. 
> {noformat} 
> FacetRequests  Sampling    Standard
>  1               391 ms     1100 ms
>  2               531 ms     1095 ms 
>  3               948 ms     1108 ms
>  4              1400 ms     1110 ms
>  5              1901 ms     1102 ms
> {noformat} 
> Is this behaviour normal? I did not expect it, as the SamplingAccumulator 
> needs to do less work? 
> Some code to show what I do:
> {code}
>       searcher.search( facetsQuery, facetsCollector );
>       final List<FacetResult> collectedFacets = 
> facetsCollector.getFacetResults();
> {code}
> {code}
> final FacetSearchParams facetSearchParams = new FacetSearchParams( 
> facetRequests );
> FacetsCollector facetsCollector;
> if ( isSampled )
> {
>       facetsCollector =
>               FacetsCollector.create( new SamplingAccumulator( new 
> RandomSampler(), facetSearchParams, searcher.getIndexReader(), taxo ) );
> }
> else
> {
>       facetsCollector = FacetsCollector.create( FacetsAccumulator.create( 
> facetSearchParams, searcher.getIndexReader(), taxo ) );
> {code}
>                       

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to