[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917407#comment-13917407
 ] 

Shai Erera commented on LUCENE-5476:
------------------------------------

I reviewed createSample in the patch, and I think something's wrong with it. As 
I understand it, you set binsize to {{1.0/sampleRatio}}, so id sR=0.1, 
binsize=10. This means that we should keep 1 out of every 10 matching 
documents. Then you iterate over the bitset, for every group of 10 documents 
you draw a representative index at random, and clear all other documents.

But what seems wrong is that the iteration advances through the "bin" 
irrespective of which documents were returned. So e.g. if the FBS has 100 docs 
total, and docs 5, 15, 25, 35, ... returned and say random.nextInt(binsize) 
picks all indexes but 5, if I understand the code correctly, all bits will be 
cleared! I double-checked with the following short main:

{code}
        public static void main(String[] args) throws Exception {
                Random random = new Random();
                FixedBitSet sampledBits = new FixedBitSet(100);
                for (int i = 5; i < 100; i += 10) {
                        sampledBits.set(i);
                }
                int size = 100;
                int binsize = 10;

                int countInBin = 0;
                int randomIndex = random.nextInt(binsize);
                for (int i = 0; i < size; i++) {
                  countInBin++;
                  if (countInBin == binsize) {
                    countInBin = 0;
                    randomIndex = random.nextInt(binsize);
                  }
                  if (sampledBits.get(i) && !(countInBin == randomIndex)) {
                    sampledBits.clear(i);
                  }
                }
                for (int i = 0; i < 100; i++) {
                        if (sampledBits.get(i)) {
                                System.out.print(i + " ");
                        }
                }
                System.out.println();
        }
{code}

And indeed in many iterations the main prints nothing, in some only 2-3 docs 
"survive" the sampling.

So first I think Gilad's pseudo-code is more correct in that it iterates over 
the matching documents and I think you should do the same here. When you do 
that, you no longer need to check bits.get(i) in order to decide if to clear it.

I wonder if your benchmark results are correct (unless you run a 
MatchAllDocsQuery?) -- can you confirm that you indeed counter 10% of the 
documents?

Another point raised by Gilad is the method of sampling. While you implement 
random sampling, there are other methods like "take the 10th document" etc. 
which will be faster. I think one can experiment with these through an 
extension of SampledDocs.createSample (and SamplingFC.createDocs), but just in 
case you want to give such a simple sampling method a try, would be interesting 
to compare randomness with something simpler.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to