[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access
[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5938: - Attachment: LUCENE-5938.patch I updated the patch to recent trunk modifications and ran the benchmark again, I think it is ready. In summary this patch: - introduces a new doc-id set impl similar to FixedBitSet but which is much faster in the sparse case and a bit slower in the dense case (between 1.5x and 4x slower according to benchmarks). - introduces a doc-id set builder that supports random write access by starting with a sparse bit set and upgrading to a dense FixedBitSet when the cardinality of the set increases - changes MultiTermQueryWrapperFilter and TermsFilter to use this new builder - removes CONSTANT_SCORE_AUTO_REWRITE and makes CONSTANT_SCORE_FILTER_REWRITE the default For queries that match many documents ({{wikimedium10m.tasks}}, the builder always ends up building a FixedBitSet), this new patch can be slower or faster depending on the cost to iterate the matching terms (since they are enumerated only once now) vs. the cost of building the doc-id set. For queries that match few documents ({{low_freq.tasks}}, attached to this issue), this new patch makes things faster since it just sets a couple of bits in random order and then iterates over them instead of merging documents coming from several other iterators on the fly using a priority queue. Independently of the benchmarks, I think it's a good simplification to remove the constant-score auto rewrite mode. {noformat} wikimedium10m.tasks TaskQPS baseline StdDev QPS patch StdDev Pct diff IntNRQ8.79 (9.6%)8.41 (3.4%) -4.3% ( -15% -9%) Fuzzy2 60.83 (11.1%) 58.34 (8.7%) -4.1% ( -21% - 17%) OrNotHighMed 98.35 (13.8%) 97.12 (10.9%) -1.3% ( -22% - 27%) OrHighNotHigh 18.88 (13.7%) 18.71 (11.1%) -0.9% ( -22% - 27%) OrNotHighHigh 17.10 (13.4%) 17.03 (11.2%) -0.4% ( -22% - 27%) OrNotHighLow 126.52 (13.6%) 126.85 (10.9%) 0.3% ( -21% - 28%) OrHighMed 76.90 (14.0%) 77.25 (11.4%) 0.5% ( -21% - 30%) OrHighNotLow 41.29 (14.3%) 41.51 (12.4%) 0.5% ( -22% - 31%) OrHighNotMed 57.70 (13.6%) 58.03 (11.6%) 0.6% ( -21% - 29%) OrHighLow 73.14 (14.7%) 73.74 (12.0%) 0.8% ( -22% - 32%) LowSloppyPhrase 127.22 (8.6%) 128.43 (3.8%) 1.0% ( -10% - 14%) OrHighHigh 29.11 (15.1%) 29.41 (12.2%) 1.0% ( -22% - 33%) HighSloppyPhrase 12.83 (10.4%) 13.02 (5.3%) 1.4% ( -12% - 19%) Prefix3 113.92 (9.9%) 115.71 (2.4%) 1.6% ( -9% - 15%) PKLookup 297.13 (9.2%) 302.03 (3.5%) 1.6% ( -10% - 15%) MedSloppyPhrase 38.60 (8.8%) 39.26 (3.7%) 1.7% ( -9% - 15%) AndHighHigh 71.39 (6.9%) 72.67 (0.9%) 1.8% ( -5% - 10%) HighTerm 87.17 (7.9%) 88.85 (2.1%) 1.9% ( -7% - 12%) MedPhrase 74.60 (9.3%) 76.10 (4.3%) 2.0% ( -10% - 17%) LowPhrase 21.67 (9.6%) 22.12 (4.0%) 2.1% ( -10% - 17%) AndHighMed 297.13 (9.4%) 303.73 (2.1%) 2.2% ( -8% - 15%) HighPhrase 16.65 (8.2%) 17.04 (3.7%) 2.3% ( -8% - 15%) HighSpanNear4.56 (10.7%)4.67 (6.1%) 2.4% ( -12% - 21%) LowTerm 769.53 (7.8%) 788.24 (2.0%) 2.4% ( -6% - 13%) AndHighLow 726.76 (10.6%) 744.51 (4.2%) 2.4% ( -11% - 19%) MedSpanNear 65.27 (9.3%) 67.00 (3.2%) 2.6% ( -9% - 16%) Wildcard 114.28 (9.1%) 118.05 (7.4%) 3.3% ( -12% - 21%) LowSpanNear 174.75 (10.3%) 180.83 (3.5%) 3.5% ( -9% - 19%) Fuzzy1 67.63 (11.3%) 70.08 (3.2%) 3.6% ( -9% - 20%) MedTerm 209.00 (9.3%) 216.71 (1.9%) 3.7% ( -6% - 16%) Respell 48.30 (10.6%) 50.58 (1.7%) 4.7% ( -6% - 18%) low_freq.tasks TaskQPS baseline StdDev QPS patch StdDev Pct diff PKLookup 278.50 (8.8%) 297.48 (13.9%) 6.8% ( -14% -
[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access
[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5938: - Attachment: LUCENE-5938.patch Thanks for the review Mike, here is a new patch that tries to address your concerns. bq. Maybe fix word to be long instead Done. bq. do we need to guard against zeroWords being 0? I added a comment as well as a test as you suggested. {quote} Maybe add some more comments around tricky parts of SparseFixedBitSet. E.g., the different branches inside set? And, it looks strange doing 1L i, but in fact the JVM/processor make that 1L (i % 64). And Iterator.currentOrNextDoc is scary looking... do we have enough tests here? {quote} I added more comments, hopefully they make sense. Please let me know if there are still things that are not clear. currentOrNextDoc is a bit complicated because of the different cases that need to be handled but the structure is actually quite simple so at least get and set should be easy to understand. I extracted the insertion of a new long to a separate method, this should make set easier to read? Regarding the shift, indeed it relies on the fact that shifts are mod 64 (FixedBitSet does the same). I added some comments about it. Regarding the tests, BaseDocIdSetTestCase.testAgainstBitSet tests various densities and assertEquals checks nextDoc(), docId(), interleaved calls to nextDoc() and advance(), and that the oal.util.Bits representation is consistent with the iterator. I think that is good? bq. I hit this test failure, which reproduces with the patch I dug that one, and the reason is that the searcher is created with threads, so the exception is indeed wrapped into an ExecutionException, which is in turn wrapped into a RuntimeException to by-pass the fact that ExecutionException is checked. It doesn't reproduce on trunk because the default rewrite method reads data from the index in MultiTermQuery.rewrite (collectTerms) which does not run in a thread pool. You can reproduce the issue on trunk by setting the rewrite method of the term range query to {{CONSTANT_SCORE_FILTER_REWRITE}}. I fixed the test in the patch to walk down the causes of the exception that is thrown. New DocIdSet implementation with random write access Key: LUCENE-5938 URL: https://issues.apache.org/jira/browse/LUCENE-5938 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, low_freq.tasks We have a great cost API that is supposed to help make decisions about how to best execute queries. However, due to the fact that several of our filter implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, either we use the cost API and make bad decisions, or need to fall back to heuristics which are not as good such as RandomAccessFilterStrategy.useRandomAccess which decides that random access should be used if the first doc in the set is less than 100. On the other hand, we also have some nice compressed and cacheable DocIdSet implementation but we cannot make use of them because TermsFilter requires a DocIdSet that has random write access, and FixedBitSet is the only DocIdSet that we have that supports random access. I think it would be nice to replace FixedBitSet in those filters with another DocIdSet that would also support random write access but would have a better cost? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access
[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5938: - Attachment: low_freq.tasks LUCENE-5938.patch OK, I did something slightly different. It happens that all queries in the tasks file match a pretty large number of documents, which favors FixedBitSet. So now I've configured a threshold: FixedBitSet is used when more than maxDoc / 16384 docs match and SparseFixedBitSet is used otherwise. Since SparseFixedBitSet is much faster than FixedBitSet for such low densities, the cost to start by creating a SparseFixedBitSet and then upgrading to a FixedBitSet is negligible compared to starting with a FixedBitSet from the beginning (see http://people.apache.org/~jpountz/doc_id_sets2.html). So now the benchmark looks better for those queries that match many documents: {noformat} IntNRQ7.10 (6.3%)6.57 (9.6%) -7.4% ( -21% -9%) Prefix3 110.36 (14.8%) 109.88 (9.5%) -0.4% ( -21% - 28%) Wildcard 62.83 (14.5%) 66.93 (9.5%) 6.5% ( -15% - 35%) {noformat} I don't think the improvement with {{Wildcard}} is noise, I can reproduce it easily. I think the reason is that since the default is filter rewrite now, we don't have to compute the terms intersection twice, which is costly with wildcard queries. I also wanted to see what happens with queries that match fewer documents compared to boolean rewrite, so I generated a set of wildcard queries that are expanded to a couple of terms and don't match too many documents (see tasks file attached): {noformat} Wildcard 99.90 (9.0%) 294.66 (30.6%) 194.9% ( 142% - 257%) {noformat} For such queries, the new default rewrite method looks much better. New DocIdSet implementation with random write access Key: LUCENE-5938 URL: https://issues.apache.org/jira/browse/LUCENE-5938 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, low_freq.tasks We have a great cost API that is supposed to help make decisions about how to best execute queries. However, due to the fact that several of our filter implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, either we use the cost API and make bad decisions, or need to fall back to heuristics which are not as good such as RandomAccessFilterStrategy.useRandomAccess which decides that random access should be used if the first doc in the set is less than 100. On the other hand, we also have some nice compressed and cacheable DocIdSet implementation but we cannot make use of them because TermsFilter requires a DocIdSet that has random write access, and FixedBitSet is the only DocIdSet that we have that supports random access. I think it would be nice to replace FixedBitSet in those filters with another DocIdSet that would also support random write access but would have a better cost? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access
[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5938: - Attachment: LUCENE-5938.patch I have been trying to think about how to do it and here is a first iteration. This DocIdSet, called SparseFixedBitSet is similar to FixedBitSet except that it only stores words that have at least one bit that is set. It is inspired from hash array mapped tries, and uses Long.bitCount/numberOfTrailingZeros operations in order to lookup words given an index. Some more notes: - like FixedBitSet it supports random access and implements the Bits interface - although not completely accurate, its cost() method should return numbers that are very close to the set cardinality, especially on sparse sets that have bits uniformly dispersed - memory usage is much better on sparse sets compared to FixedBitSet - it supports random write access, so could be used eg. in TermsFilter I re-ran the benchmark that I had for our doc-id sets with this new impl and it seems to perform very well: http://people.apache.org/~jpountz/doc_id_sets2.html New DocIdSet implementation with random write access Key: LUCENE-5938 URL: https://issues.apache.org/jira/browse/LUCENE-5938 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Attachments: LUCENE-5938.patch We have a great cost API that is supposed to help make decisions about how to best execute queries. However, due to the fact that several of our filter implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, either we use the cost API and make bad decisions, or need to fall back to heuristics which are not as good such as RandomAccessFilterStrategy.useRandomAccess which decides that random access should be used if the first doc in the set is less than 100. On the other hand, we also have some nice compressed and cacheable DocIdSet implementation but we cannot make use of them because TermsFilter requires a DocIdSet that has random write access, and FixedBitSet is the only DocIdSet that we have that supports random access. I think it would be nice to replace FixedBitSet in those filters with another DocIdSet that would also support random write access but would have a better cost? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5938) New DocIdSet implementation with random write access
[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5938: - Attachment: LUCENE-5938.patch Here is an updated patch: - this sparse set is used in MultiTermQueryWrapperFilter and TermsFilter - constant-score auto rewrite has been removed in favor of filter rewrite I didn't change BooleanFilter and ChainedFilter because I think they need more work. For example, it looks wrong to consume all documents of a filter when computing a conjunction instead of taking a leap-frog approach. Let's tackle this in another issue? Tests pass. New DocIdSet implementation with random write access Key: LUCENE-5938 URL: https://issues.apache.org/jira/browse/LUCENE-5938 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Attachments: LUCENE-5938.patch, LUCENE-5938.patch We have a great cost API that is supposed to help make decisions about how to best execute queries. However, due to the fact that several of our filter implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, either we use the cost API and make bad decisions, or need to fall back to heuristics which are not as good such as RandomAccessFilterStrategy.useRandomAccess which decides that random access should be used if the first doc in the set is less than 100. On the other hand, we also have some nice compressed and cacheable DocIdSet implementation but we cannot make use of them because TermsFilter requires a DocIdSet that has random write access, and FixedBitSet is the only DocIdSet that we have that supports random access. I think it would be nice to replace FixedBitSet in those filters with another DocIdSet that would also support random write access but would have a better cost? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org