[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-5476: --- Attachment: LUCENE-5476.patch I reviewed the patch more closely before I commit: * Modified few javadocs * Removed needsSampling() since we don't offer an extension any access to e.g. totalHits. We can add it when there's demand. * Fixed a bug in how carryOver was implemented -- replaced by two members {{leftoverBin}} and {{leftoverIndex}}. So now if {{leftoverBin != -1}} we know to skip the first such documents in the next segment and depending on {{leftoverIndex}}, whether we need to sample any of them. Before that, we didn't really skip over the leftover docs in the bin, but started to count a new bin. * Added a CHANGES entry. I reviewed the test - would be good if we can write a unit test which specifically matches only few documents in one segment compared to the rest. I will look into it perhaps later. I think it's ready, but if anyone wants to give createSample() another look, to make sure this time leftover works well, I won't commit it by tomorrow anyway. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch Removed scores. Added javadoc explaining what happens to scores. Removed System.out.println > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch New patch. I'm still not really sure about the scorings, but please take a look at it. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, > SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch I have a patch ready that implements the random sampling usign override on {{.getMatchingDocs()}}. It passes the test, so it should be ok :). It is slower however (only 3x speedup), but maybe there is room for optimization? Exact :168 ms Sampled:55 ms > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch * Reverted move of XorShift64Random * Added basic test for random sampling. Uses sampling ratios of 1.0d and 0.1d. 1.0d should behave like no sampling at all. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch I have taken these points into account in the new patch. * Separated {{XORShift64Random}} to o.a.l.util * Moved SampledDocs and declared private * Fixed some typos / redundancy in the javadocs. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, SamplingComparison_SamplingFacetsCollector.java, > SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch New patch. My little benchmark results: Exact :176 ms Sampled:36 ms * I removed the {{SamplingParams}}. * I use random sampling in bins in the {{addDoc}}. This has almost no performance impact over fix-internal samling when running with a sampleRatio of 0.001. * Renamed the class to {{RandomSamplingFacetsCollector}} * I decided to always add the first document of a segment. This counters the fact that randomindex in the last bin is larger than the segment-size. Gives slightly better results. I do have some questions: * Code style-wise: I see {{this.}} is sometimes used, and sometimes not. What do you prefer? * It seems {{FixedBitSet}} contains a lot of empty space, as only 1 in 1000 hits are preserved. I'm not that familiar with all the DocIdSet implementations, but maybe there are more suitable subclasses? > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilad Barkai updated LUCENE-5476: - Attachment: SamplingComparison_SamplingFacetsCollector.java Nice patch, I was surprised by the gain - only 3.5 time the gain for 1/1000 of the documents... I think this is because the random generation takes to long? Perhaps it is to heavy to call {{.get(docId)}} on the original/clone and {{.clear()}} for every bit. I crafted some code for creating a new (cleared) {{FixedBitSet}} instead of a clone, and setting only the sampled bits. With it, instead of going on ever document and calling {{.get(docId)}} I used the iterator, which may be more efficient when it comes to skipping cleared bits (deleted docs?). Ran the code on a scenario as you described (Bitset sized at 10M, randomly selected 2.5M of it to be set, and than sampling a thousandth of it to 2,500 documents). The results: * none-cloned + iterator: ~20ms * cloned + for loop on each docId: ~50ms {code} int countInBin = 0; int randomIndex = random.nextInt(binsize); final DocIdSetIterator it = docIdSet.iterator(); try { for( int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) { if (++countInBin == binsize) { countInBin = 0; randomIndex = random.nextInt(binsize); } if (countInBin == randomIndex) { sampledBits.set(doc); } } } catch (IOException e) { // should not happen throw new RuntimeException(e); } {code} Also attaching the java file with a main() which compares the two implementations (original still named {{createSample()}} and the proposed one {{createSample2()}}. Hopefully, if this will be consistent, the sampling end-to-end should be closer the 10 factor gain as hoped for. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch Here is a patch. I'm not totally satisfied with it though. There are two loose ends: * When a segment has many docids but very few hits, they might be skipped altogether. If this happens in many segments, the estimate gets more and more off * When a segment is small, there might be a few hits that are never touched as the randomIndex in the 'bin-o-hits' is greater than the index of the hit. Some small performance indicators: Using an in memory index with 10M documents, retrieving 25% of them and using a samplingRatio of 0.001 (skipping the first call, average on the next three) Exact: 184 ms. Sampled: 67 ms. Not the 10-fold increase I had hoped for, but significantly faster. Btw. for my use case the two problems I described above are not a real issue. I do a global count anyway, and use this to determine if the facets need to be sampled or not. When they need to be sampled, I know for sure that there are enough hits to give a proper estimate. The price of counting and sampling is lower than the price of retrieving exact facets. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, > SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: LUCENE-5476.patch Here is a patch (agains 4.7) that covers some of the feedback. > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5476) Facet sampling
[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Audenaerde updated LUCENE-5476: --- Attachment: SamplingFacetsCollector.java > Facet sampling > -- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Rob Audenaerde > Attachments: SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org