[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915650#comment-13915650
 ] 

Rob Audenaerde commented on LUCENE-5476:
----------------------------------------

I'm currently expermenting with this. To increase the speed it seems logical to 
me the {{FacetsCollector}} needs to return less hits. I have a slighly modified 
version that I will attach. 
It uses a sampling technique that divides the total hits in to 'bins' of a 
given size; and takes one sample of that bin. I have implemented it as keeping 
that one sample as 'hit' of the search if it was a hit, and clearing all other 
bits.  See the attached file. 

By using this technique the distribution of the results should not be altered 
too much, while the performance gains can be significant. 

A quick test revealed that for 1M results and binsize 500, the sampled version 
is twice as fast.

The problem it that the resulting {{FacetResult}}s are not correct, as the 
number of hits is reduced. This can be fixed afterwards for counting facets by 
multiplying with the binsize; but for other facets it will be more difficult or 
will require other approaches. 

What do you think?

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to