[
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gilad Barkai updated LUCENE-5476:
---------------------------------
Attachment: SamplingComparison_SamplingFacetsCollector.java
Nice patch,
I was surprised by the gain - only 3.5 time the gain for 1/1000 of the
documents... I think this is because the random generation takes to long?
Perhaps it is to heavy to call {{.get(docId)}} on the original/clone and
{{.clear()}} for every bit.
I crafted some code for creating a new (cleared) {{FixedBitSet}} instead of a
clone, and setting only the sampled bits. With it, instead of going on ever
document and calling {{.get(docId)}} I used the iterator, which may be more
efficient when it comes to skipping cleared bits (deleted docs?).
Ran the code on a scenario as you described (Bitset sized at 10M, randomly
selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500
documents).
The results:
* none-cloned + iterator: ~20ms
* cloned + for loop on each docId: ~50ms
{code}
int countInBin = 0;
int randomIndex = random.nextInt(binsize);
final DocIdSetIterator it = docIdSet.iterator();
try {
for( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS;
doc = it.nextDoc()) {
if (++countInBin == binsize) {
countInBin = 0;
randomIndex = random.nextInt(binsize);
}
if (countInBin == randomIndex) {
sampledBits.set(doc);
}
}
} catch (IOException e) {
// should not happen
throw new RuntimeException(e);
}
{code}
Also attaching the java file with a main() which compares the two
implementations (original still named {{createSample()}} and the proposed one
{{createSample2()}}.
Hopefully, if this will be consistent, the sampling end-to-end should be closer
the 10 factor gain as hoped for.
> Facet sampling
> --------------
>
> Key: LUCENE-5476
> URL: https://issues.apache.org/jira/browse/LUCENE-5476
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Rob Audenaerde
> Attachments: LUCENE-5476.patch, LUCENE-5476.patch,
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared.
> When trying to display facet counts on large datasets (>10M documents)
> counting facets is rather expensive, as all the hits are collected and
> processed.
> Sampling greatly reduced this and thus provided a nice speedup. Could it be
> brought back?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]