[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915729#comment-13915729
 ] 

Michael McCandless commented on LUCENE-5476:
--------------------------------------------

This looks great!  

bq. To increase the speed it seems logical to me the FacetsCollector needs to 
return less hits.

*fewer* hits :)  (Sorry, pet peeve).

Maybe, you could add a utility method on SamplingFacetsCollector to "fixup" the 
FacetResult assuming it's a simple count (e.g., multiply the aggregate by the 
bin size)?  The old sampling code 
(https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_6/lucene/facet/src/java/org/apache/lucene/facet/sampling
 ) has something like this.

It might be good to allow passing the random seed, for repeatable results?

Another option, which would save the 2nd pass, would be to do the sampling 
during Docs.addDoc.

Also, instead of the bin-checking, you could just pull the next random double 
and check if it's < 1.0/binSize?

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to