[jira] [Commented] (LUCENE-5476) Facet sampling

Gilad Barkai (JIRA) Thu, 06 Mar 2014 04:01:04 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922374#comment-13922374
 ]


Gilad Barkai commented on LUCENE-5476:
--------------------------------------

Hi Rob, patch looks great.

A few comments:
* Some imports are not used (o.a.l.u.Bits, o.a.l.s.Collector & o.a.l.s.DocIdSet)
* Perhaps the parameters initialized in the RandomSamplingFacetsCollector c'tor 
could be made {{final}}
* XORShift64Random.XORShift64Random() (default c'tor) is never used. Perhaps it 
was needed for usability when this was thought to be a core utility and was 
left by mistake? Should it be called somewhere?
* {{getMatchingDocs()}}
** when {{!sampleNeeded()}} there's a call to {{super.getMatchingDocs()}}, this 
may be redundant method call as 5 lines above we call it, and the code always 
compute the {{totalHits}} first. Perhaps the original matching docs could be 
stored as a member? This would also help for some implementations of correcting 
the sampled facet results.
** {{totalHits}} is redundantly computed again in line 147-152 
* {{needsSampling()}} could perhaps be protected, allowing other criteria for 
sampling to be added
* {{createSample()}}
** {{randomIndex}} is initialized to {{0}}, effectively making the first 
document of every segment's bin to be selected as the representative of that 
bin, neglecting the rest of the bin (regardless of the seed). So if a bin is 
the size of a 1000 documents, than there are 999 documents that regardless of 
the seed would always be neglected. It may be better so initialize as 
{{randomIndex = random.nextInt(binsize)}} as it happens for the 2nd and on 
bins. 
** While creating a new {{MatchingDocs}} with the sampled set, the original 
{{totalHits}} and original {{scores}} are used. I'm not 100% sure the first is 
an issue, but any facet accumulation which would rely on document scores would 
be hit by the second as the {{scores}} (at least by javadocs) are defined as 
non-sparse.


> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5476) Facet sampling

Reply via email to