[jira] [Commented] (LUCENE-5476) Facet sampling

Shai Erera (JIRA) Mon, 03 Mar 2014 07:41:09 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918164#comment-13918164
 ]


Shai Erera commented on LUCENE-5476:
------------------------------------

+1 for removing SamplingParams.

I'm OK if the original collected hits is lost, let's just make sure we document 
this on the collector. Perhaps in the future (separate issue) we can add 
another collector (or enhance this one) to allow you to get both the original 
docs and the sampled docs.

I wonder what are the performance implications of sampling during collection or 
post collection. In the past, Mike and I saw that not interfering with 
collection improves performance, though what we measured was accessing the 
ordinals-DV during collection. Just wondering if in-collection performs better 
than post-collection. Especially if sampleRatio is low, it means we set far 
fewer hits on the FBS, just to clean most of them afterwards. On the other hand 
we add calls to random.nextInt/Double, so that's a tradeoff which would be good 
to measure. We don't even need random.nextDouble(), we can do what you/mike 
suggested above -- work in "bins", for each bin draw a random index and discard 
all hits given to addDoc unless it is the binIndex.

I also think that if we keep the original FBS (whether we clone-and-clear, 
create-new-sampled-one or whatever), we should iterate on the matching docs and 
not all the bits. I don't see the logic of why that's good at all, unless the 
bitset.cardinality() is very big (like maybe 75% of the bitset size)? Of 
course, if we move to sample in-collection, that's not an issue at all.

Rob, I want to make sure we are all on the same page as to what we want to 
achieve in this issue (helps scope it):

* Add SamplingFacetsCollector which takes {{sampleRatio}} and optional {{seed}} 
and random-samples the matching docs so that facets are counted on fewer hits
* We should note that as a result the weights computed for facets are 
approximation only, and the application needs to correct them if it wants (e.g. 
multiplying by the inverse of {{sampleRatio}} or exact count etc.). At any 
rate, this is something that we don't plan to tackle in this issue, right?

Maybe we should call it RandomSamplingFacetsCollector to denote it's a random 
sample (vs e.g the approaches mentioned by Gilad above)? Then the parameter 
{{seed}} is clear as to what it means.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5476) Facet sampling

Reply via email to