[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch Here the updated patch after some changes in trunk. It also adds missCount checks back to Caching*Filters, I lost then during cleanup. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch Patch that fixes the Weight.scoreDocsOutOfOrder method to return the inner weight's setting. The scorer can still return docs in order, but that was identical behaviour in previous unpatched trunk (IS looked at the out-of- order setting of the weight and uses correct collector, but once a filter was applied, the documents came in order). My patch only missed to pass this setting to our wrapper query. Mike: If you have time, can you check this? We may need a test, that uses a larger index and tests FilteredQuery on top of it, the current indexes used for filtering are simply too small and in most cases have only one segment :( There is no need for Robert's hack (that does not work correctly with aceptDocs != liveDocs), if different BooleanScorers return significant different scores, it as a bug, not a problem in FilteredQuery. Slight score changes and therefor different order in results is not a problem at all - this is just my opinion. bq. If we want to check that the results are identical, the benchmark test must explicitely request docs-in-order on trunk vs. patch to be consistent. But then it's no longer a benchmark. This is of course untrue, sorry. If the weight returns that docs *may* come out of order, the collector should handle this. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: (was: LUCENE-1536.patch) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch Sorry last patch upload was somehow corrupted, maybe because of JIRA issues. This is the one only implementing Weight.scoreDocsOutOfOrder(), no hacks - to have a clean start. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: (was: LUCENE-1536.patch) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: LUCENE-1536_hack.patch hack patch that computes the heuristic up front in weight init, so it scores all segments consistently and returns the proper scoresDocsOutOfOrder for BS1. Uwe's new test (the nestedFilterQuery) doesnt pass yet, don't know why. I recomputed the benchmarks: {noformat} Task QPS trunkStdDev trunk QPS patchStdDev patch Pct diff PhraseF1.0 11.990.207.790.23 -37% - -31% TermF0.5 135.147.62 116.570.36 -18% - -8% AndHighHighF100.0 17.340.78 15.440.15 -15% - -5% AndHighHighF95.0 17.280.66 15.480.17 -14% - -5% AndHighHighF90.0 17.310.76 15.580.19 -14% - -4% AndHighHighF99.0 17.051.02 15.450.17 -15% - -2% AndHighHighF75.0 17.470.78 16.030.15 -12% - -3% AndHighHighF5.0 20.690.95 19.780.23 -9% - 1% AndHighHighF1.0 35.111.46 33.640.36 -8% - 1% AndHighHighF0.1 136.043.70 132.001.41 -6% - 0% AndHighHigh 18.250.70 17.740.20 -7% - 2% AndHighHighF0.5 49.841.72 48.580.49 -6% - 1% TermF0.1 351.18 11.01 345.851.73 -4% - 2% Fuzzy2F100.0 95.524.21 94.332.07 -7% - 5% SloppyPhraseF100.08.010.287.910.09 -5% - 3% Fuzzy2F90.0 95.423.86 94.511.74 -6% - 5% Fuzzy2F95.0 95.204.86 94.331.83 -7% - 6% Fuzzy1F1.0 54.021.67 53.561.07 -5% - 4% PhraseF2.07.730.077.680.18 -3% - 2% SloppyPhraseF99.07.990.237.950.10 -4% - 3% AndHighHighF50.0 17.540.79 17.460.12 -5% - 4% Fuzzy2F0.1 105.393.93 105.343.74 -7% - 7% SpanNearF100.03.160.063.160.04 -2% - 2% Fuzzy2F99.0 94.026.86 94.211.97 -8% - 10% Fuzzy2F75.0 95.563.51 95.762.02 -5% - 6% WildcardF2.0 52.790.27 53.050.57 -1% - 2% Fuzzy1F0.5 58.121.83 58.431.22 -4% - 5% PhraseF0.1 66.340.78 66.731.68 -3% - 4% SloppyPhraseF0.1 56.151.52 56.790.64 -2% - 5% SloppyPhrase8.080.268.180.08 -2% - 5% PKLookup 176.595.07 178.965.71 -4% - 7% SpanNearF0.1 32.360.56 32.830.54 -1% - 4% OrHighHighF0.1 78.200.52 79.440.740% - 3% SloppyPhraseF95.07.910.088.050.090% - 3% Fuzzy2 94.873.72 96.491.62 -3% - 7% OrHighHighF0.5 31.410.47 31.960.330% - 4% SpanNearF99.03.120.063.180.030% - 4% WildcardF0.5 61.970.56 63.280.820% - 4% PhraseF0.5 19.780.26 20.290.310% - 5% SpanNear3.190.083.270.05 -1% - 6% WildcardF0.1 67.450.64 69.240.890% - 4% SloppyPhraseF90.08.000.298.210.12 -2% - 8% SpanNearF95.03.130.043.230.031% - 5% Wildcard 43.190.34 44.641.400% - 7% Fuzzy2F50.0 95.124.22 98.692.28 -2% - 11% Fuzzy1 55.284.53 57.680.76 -4% - 15% OrHighHigh 12.130.99 12.710.43 -6% - 18% Phrase3.600.043.810.043% - 7% SpanNearF90.03.150.053.350.043% - 9% Term 71.690.40 76.534.130% - 13% PhraseF99.03.430.033.680.045% - 9% PhraseF100.03.390.053.670.045% - 10% SloppyPhraseF75.08.04
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: changes-yonik-uwe.patch LUCENE-1536.patch LUCENE-1536-rewrite.patch Attached you will find a new patch LUCENE-1536.patch, incorporating Yonik's changes plus some minor improvements: - changed Javadocs of DIS.bits() to explain what you should do/not do. - Added another early exit condition in FilteredQuery#Weight.scorer(): As we already get the first matching doc of the filter iterator before looking at bits or creating the query scorer, we should erly exit, if the first matching doc is Disi.NO_MORE_DOCS. This saves us from creating the Query Scorer. - I removed Robert's safety TODO in SolrIndexSearcher. It no longer disabled random access completely. After Yoniks changes, all places in Solr that are not random access secure are disabled - e.g. SolrIndexSearcher.FilterImpl (not sure what this class does, maybe it should also implement bits()?) - we should do that in a Solr specific optimization issue. Some other cool thing with filters is ANDing filters without ChainedFilter (this approach is is very effective with random access as it does not allocate additional BitsSet). If you want to AND together several filters and apply them to a Query, do the following: {code:java} IS.search(new FilteredQuery(query,filter2), filter1,...); {code} You can chain even more filters in by adding more FilteredQueries. What this does: IS will automatically create another FilteredQuery to apply the filter and get the Weight of the top-level FilteredQuery. The scorer of this one will be top-level, get the filter and if it is random access, it will execute the filter with acceptDocs==liveDocs. The result bits of this filter will be passed to Weight.scorer of the second FilteredQuery as acceptDocs. This one will pass the acceptDocs (which are already filtered) to its Filter and if again random access pass those as acceptDocs to the inner Query's scorer. Finally the top-level IS will execute scorer.score(Collector), which in fact is the inner Query's scorer (no wrappers!) with all filtering applied in acceptDocs. This is incredible cool :-) One thing about large patches in an issue: If you are working on an issue and have you local changes in your checkout and posted a patch to an issue and somebody else, posted an updated patch to an issue, it is often nice to see the diff between those patches. I wanted to see what Yonik changed, but a 140 K patch is not easy to handle. The trick is interdiff from patchutils package: You can call interdiff LUCENE-1536-original.patch LUCENE-1536-yonik.patch and you get a patch of only changes applied by Yonik. This patch can even be applied to your local already patched checkout. The changes-yonik-uwe.patch was generated that way and shows, what changes I did in my last patch in contrast to Yoniks original. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: LUCENE-1536.patch patch moving the heuristic to FilteredQuery, AssertingIndexSearcher overrides wrapFilter() and returns random.nextBoolean(). In some tests we explicitly tested on/off, i fixed these to still do this, however i think its a little overkill (since newSearcher randomizes this anyway), but i left them working the same way. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch Roberts patch with TestFilteredQuery cleaned up a little bit (not thousands of anonymous subclasses) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch Improved the test for random access with BucktScorer to also chain two filters. This should pass the acceptDocs down so both filters are ANDed together. Also made AssertingIndexSearcher also use the autodetection. In half of all cases it uses autodetection from parent class, in the other half it decides between random access and iterator access. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, changes-yonik-uwe.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536-rewrite.patch Improved patch with modules and contrib fixed, Solr still on TODO: - BooleanFilter and ChainedFilter do not apply acceptDocs to their Filter clauses. Instead they apply the acceptDocs on the final DocIdSet using BitsFilteredDocIdSet (see below). This improves filter performance, as the deleted documents are not applied on each clause. - New helper class BitsFilteredDocIdSet, which supplies a wrap method, that can apply a Bits (e.g. acceptDocs) to a DocIdSet. This is useful for Filters, that build DocIdSets without respecting the acceptDocs parameter and only want to apply the deletions live. - CachingWrapperFilter and CachingSpanFilter now also use the acceptDocs wrapper, as the filters are cached without acceptDocs. - Fixed Javadocs, small changes to IndexSearcher if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: LUCENE-1536.patch updated patch: i fixed solr to the api changes and simply disabled the optimization in SolrIndexSearcher. I think this is the most conservative way to proceed, we can then open a followup issue to make whatever changes are necessary to Solr APIs so it can use the optimization (looks complex) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536-rewrite.patch Patch with Solr fixes by Robert improved. Now usage of acceptDocs is correct. There is room to optimize, but this is the correct way to solve (as Filters are required to respect deleted docs in trunk). Now the remaining caching wrapper changes, then we are ready. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: luceneutil.patch if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, luceneutil.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536-rewrite.patch A first rewrite of Lucene core to pass acceptDocs down to Filter.getDocIdSet: - optimized and simpliefied CachingWrapper* - no deletesmode anymore - FieldCacheTermsFilter has optimized DocIdSet - Added bits() to all DocIdSet - IndexSearcher.searchWithFilter was rewritten to pass liveDocs down. - AndBits is no longer needed The tests are not yet rewritten, still 55 compile errors This patch is just for review if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: LUCENE-1536.patch adding back this optimization, again. before committing please give me time to write tests to ensure we aren't losing these optimizations. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536-rewrite.patch Here a patch with almost all core tests rewritten (I left out the CachingWrapper tests, as I nuked DeletesMode). Its just for demonstartion. Some tests have really stupid filters and work only with optimized indexes. I added asserts in those filters (except one), that acceptDocs==null. The remaining one uses QueryUtils and I have no idea whats going on there, that the acceptDocs!=null. When looking at the code in IndexSearcher, I would propose to remove all Filter special handling in IndexSaercher and move all code over to FilteredQuery (with all our optimizations). If you call IS.search(query, filter,...), IndexSearcher would simply wrap with FilteredQuery and we would have no code duplication and much easier maintainability in IS. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536-rewrite.patch New patch (still only Lucene Core, no contrib/modules/solr modified): - Nuked Filter handling completely from IndexSearcher. Algorithms and Random access optimizations were added to FilteredQuery. IS.search(Query, Filter,...) now only wraps the query with the Filter, if filter!=null (small helper method). - The random access threshhold is still in IndexSearcher.setFilterRandomAccessThreshold(), FilteredQuery gets it in it's weight from IndexSearcher. This is maybe not the best solutions, we can also add a setter to FilteredQuery and IS passes it to FilteredQuery. What do you think? Mike: Can you do perf tests? if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536-rewrite.patch New patch: - Fixed the FilteredQuery-Scorer's advance by logic change. Its now much easier to understand. The corresponding tests are in TestFilteredQuery: All tests are executed 2 times: as random access filter and as iterator filter. Also FilteredQuery is added to BQ, so the conventional scorer (nextDoc/advance) is tested. The tests for CachingWrapper* are still disabled, have to rewrite them tomorrow. Then we can change contrib and Solr. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch New patch: - Fixed FieldCacheRangeFilterDocIdSet to implement bits - Removed deleted docs handling in this filter. We explicitely pass false. No the iterator and get(bits) may return deleted docs. - Made AndBits final (speed and no need to subclass) The question here: I see no filter that explicitely returns true for the deletedDocs handling. But e.g. most filters actually dont have deleted docs, like MultiTermQueryWF,... if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1536: -- Attachment: LUCENE-1536.patch I modified my patch for FieldCacheRangeFilter to enforce that it does not respect deletes (it throws UOE on the setter). I checked some combinations, setContainsOnlyLiveDocs() cannot go in like that!!! CachingWrapperFilter should have its own handling of this and return the correct setting, but not modify an already created bitset/whatever DocIdSet. My above patch does no longer respect deleted docs in FieldCacheRangeFilter and returns false from the beginning. But if you cache this filter, it suddenly tries to set the boolean to true - test fail Also all Filters that internally use deleted docs, should return true from the beginning. We can maybe add the setter to FixedBitSet, so filter impls can easily set this bit, but it should not be a setter in DocIdSet! if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: LUCENE-1536.patch updated patch: * implements heuristic with getter/setter (defaults to random access if we estimate filter accepts 1% of documents) * sets the inOrder/topLevel stuff correct when we use random access -- but we should add a test that we get BS1 here! * when the filter is sparse, but contains only live docs, we pass null as liveDocs. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1536: Attachment: LUCENE-1536.patch just a code style tweak, a test to make sure we get BS1 with random access filters, and randomization of the parameter value in LuceneTestCase. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-1536: --- Attachment: LUCENE-1536.patch Small improvements: - Changed isContainsOnlyLiveDocs to containsOnlyLiveDocs - Changed visibility of containsOnlyLiveDocs field since with the getter/setters it doesn't need to be protected. - Tidied the documentation on the getter/setters of the threshold in IS I think we're good to go? if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-1536: --- Attachment: LUCENE-1536.patch Patch updated following the changes in LUCENE-3474. (Note, this patch also fixes a missed change in LUCENE-3474, in MultiPhraseQuery.UnionDocsAndPositionsEnum). New patch features: - liveDocsOnly() - containsOnlyLiveDocs() - getRandomAccessBits() and useRandomAccess() are gone, replaced by some additional logic in IS. Now we check if the DocIdSet is a Bits impl, and then we consult a protected method in IS called useRandomAccess(). - At the moment useRandomAccess() just returns true, while we discuss what heuristics are best to apply here. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Attachment: LUCENE-1536.patch New patch folding in Uwe's ideas -- FCRF always uses random access; remove matchDoc and just override Bits.get; fixed CWF jdocs to not reference OpenBitSetDISI. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-1536: --- Attachment: LUCENE-1536.patch New patch which adds greater control over the random-access to DocIdSet. Also fixes most of the nocommits. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Attachment: LUCENE-1536.patch New patch, fixing a few failing tests, adding a couple comments. I downgraded the 2 nocommits in LTC to TODOs, and removed the other one (false is correct default for those two?). For DocIdSet, can we nuke getRandomAccessBits? Ie, if supportRandomAccess() returns true, then we can cast the instance to Bits? Maybe we should rename supportRandomAccess to useRandomAccess? (Ie, it may support it, but we only want to use random access when the filter is dense enough). I changed MTQWF's and CachingWrapperFilter's threshold to a double percent (of the reader's maxDoc) instead, default 1.0%. Hmm FieldCacheRangeFilter never enables random access... not sure how/when we should compute that. I think we are getting close! if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-1536: --- Attachment: LUCENE-1536.patch I think I have managed to update this to trunk. The only dated aspect is the use of OpenBitSet, which if I understand correctly, has been replaced by FixedBitSet for most cases. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Male updated LUCENE-1536: --- Attachment: LUCENE-1536.patch New patch which includes the missed merging, some small tidy-ups and converting over to FixedBitSet. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Attachment: LUCENE-1536.patch Patch. Lots of nocommits still but tests pass. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Attachment: LUCENE-1536.patch Initial patch for trunk... lots of nocommits, but tests all pass and I think this is [roughly] the approach we should take to get fast(er) Filter perf. Conceptually, this change is fairly easy, because the flex APIs all accept a Bits to apply low-level filtering. However, this Bits is inverted vs the Filter that callers pass to IndexSearcher (skipDocs vs keepDocs), so, my patch inverts 1) the meaning of this first arg to the Docs/AndPositions enums (it becomes an acceptDocs instead of skipDocs), and 2) deleted docs coming back from IndexReaders (renames IR.getDeletedDocs - IR.getNotDeletedDocs). That change (inverting the Bits to be keepDocs not skipDocs) is the vast majority of the patch. The real change is to add DocIdSet.getRandomAccessBits and bitsIncludesDeletedDocs, which IndexSearcher then consults to figure out whether to push the filter low instead of high. I then fixed OpenBitSet to return this from getRandomAccessBits, and fixed CachingWrapperFilter to turn this on/off as well as state whether deleted docs were folded into the filter. This means filters cached with CachingWrapperFilter will apply low, and if it's DeletesMode.RECACHE then it's a single filter that's applied (else I wrap with an AND NOT deleted check per docID), but custom filters are also free to impl these methods to have their filters applied low. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: core/search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1536: Labels: mentor (was: gsoc2011 lucene-gsoc-11) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1536: Labels: gsoc2011, lucene-gsoc-11 mentor, (was: mentor) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011,, lucene-gsoc-11, mentor, Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simon Willnauer updated LUCENE-1536: Labels: gsoc2011 lucene-gsoc-11 mentor (was: gsoc2011, lucene-gsoc-11 mentor,) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Labels: gsoc2011 lucene-gsoc-11 (was: ) if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Labels: gsoc2011, lucene-gsoc-11 Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Attachment: CachedFilterIndexReader.java By subclassing FilterIndexReader, and taking advantage of how the flex APIs now let you pass a custom skipDocs when pulling the postings, I created a prototype class (attached, named CachedFilterIndexReader) that up-front compiles the deleted docs for each segment with the negation of a Filter (that you provide), and returns a reader that applies that filter. This is nice because it's fully external to Lucene, and it gives awesome gains in many cases (see http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html). I don't think we should commit this class -- we should instead fix Filters correctly! But it's a nice workaround until we do that. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 4.0 Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1536: - Attachment: LUCENE-1536.patch * Added a IndexReader.termDocs(term, filter) method which culminates in passing a RandomAccessDocIdSet to the scorers. ConstantScoreQuery and FilterQuery need to AND the filters together. * Does explain need the filter? * Spans aren't supported yet if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Fix Version/s: (was: 2.9) 3.1 Moving out. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1536: - Attachment: LUCENE-1536.patch * Filter has a parameter for included deletes. IndexSearcher uses it for setting the RandomAccessDocIdSet on the Scorers * MatchAllDocsScorer in addition to TermScorer properly supports setting a RandomAccessDocIdSet * Added more test cases * AndRandomAccessDocIdSet.iterator() and BitVector.iterator() needs to be implemented if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1536: - Attachment: LUCENE-1536.patch The patch is a start at something that will work within the API (I think). * Created a RandomAccessDocIdSet class that BitVector and OpenBitSet implement. * SegmentTermDocs has a setter for RandomAccessDocIdSet with the option of including the deletedDocs * IndexSearcher.doSearch does the main work of setting the RandomAccessDocIdSet on the TermScorers * BooleanScorer2 and other classes implement getSequentialSubScorers * AndRandomAccessDocIdSet iterates over an array of RandomAccessDocIdSets if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1536: --- Fix Version/s: 2.9 Given the sizable performance gains, I think we should try to do this for 2.9, if possible. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org