[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-5476:
---------------------------------

    Attachment: SamplingComparison_SamplingFacetsCollector.java

Nice patch,

I was surprised by the gain - only 3.5 time the gain for 1/1000 of the 
documents... I think this is because the random generation takes to long?

Perhaps it is to heavy to call {{.get(docId)}} on the original/clone and 
{{.clear()}} for every bit.

I crafted some code for creating a new (cleared) {{FixedBitSet}} instead of a 
clone, and setting only the sampled bits. With it, instead of going on ever 
document and calling {{.get(docId)}} I used the iterator, which may be more 
efficient when it comes to skipping cleared bits (deleted docs?).

Ran the code on a scenario as you described (Bitset sized at 10M, randomly 
selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 
documents).
The results:
* none-cloned + iterator: ~20ms
* cloned + for loop on each docId: ~50ms

{code}
        int countInBin = 0;
        int randomIndex = random.nextInt(binsize);
        final DocIdSetIterator it = docIdSet.iterator();

        try {
          for( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; 
doc = it.nextDoc()) {
            if (++countInBin == binsize) {
              countInBin = 0;
              randomIndex = random.nextInt(binsize);
            }
            if (countInBin == randomIndex) {
              sampledBits.set(doc);
            }
          }
        } catch (IOException e) {
          // should not happen
          throw new RuntimeException(e);
        }
{code}

Also attaching the java file with a main() which compares the two 
implementations (original still named {{createSample()}} and the proposed one 
{{createSample2()}}. 

Hopefully, if this will be consistent, the sampling end-to-end should be closer 
the 10 factor gain as hoped for.

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to