[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915762#comment-13915762
 ] 

Shai Erera commented on LUCENE-5476:
------------------------------------

This looks like a great start! I have few comments/suggestions, based on the 
previous sampling impl:

* I think SamplingFC.createDocs should return a declared SampledDocs (see 
later) instead of anonymous class

* Currently that SampledDocs.getDocIdSet() creates a new sample on every call. 
This is bad not only from a performance point of view, but more because if 
there are few Facets* that call it, they will compute weights on different 
samples
** Instead, I think that SampledDocs.getDocIdSet() should return the same 
sampled DIS, i.e. by caching it.
** Also, I think it would be good if it exposes a getSampledDocIdSet which 
takes some parameters e.g. the sample size
** I think the original FBS should not be modified. E.g. in the previous 
sampling impl, you could compute the sampled top-facets but then return their 
exact counts by traversing the original matching docs and counting only them.
*** But maybe we should leave that for later, it could be a different SFC impl. 
But still please don't compute the sample on every getDocIdSet() call.

* The old implementation let you specify different parameters such as sample 
size, minimum number of documents to evaluate, maximum number of documents to 
evaluate etc. I think those are important since defining e.g. 10% sample size 
could still cause you to process 10M docs, which is slow.

I like that this impl samples per-segment as it allows to tune the sample on a 
per-segment basis. E.g. small segments (as in NRT) probably don't need to be 
sampled at all. If we allow passing different parameters such as sampleRatio, 
min/maxSampleSize, we could tune sampling per-segment.

I agree with Mike that we need to allow passing a seed for repeatability (I 
assume it will be important when testing). Maybe wrap all the parameters in a 
SamplingConfig? We can keep the sugar ctors on SFC to take only e.g. 
sampleRatio.

Also, one thing that we did in the old impl is not use Math.random() or Random 
at all. Rather some crazy formula was used to create faster samples. I don't 
say we need to do that now, but maybe drop a TODO. At any rate, I don't think 
we should use Math.random(), but rather init a Random once and call 
nextDouble(). This will also allow to reuse a seed.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to