[ https://issues.apache.org/jira/browse/LUCENE-5938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-5938: --------------------------------- Attachment: LUCENE-5938.patch I updated the patch to recent trunk modifications and ran the benchmark again, I think it is ready. In summary this patch: - introduces a new doc-id set impl similar to FixedBitSet but which is much faster in the sparse case and a bit slower in the dense case (between 1.5x and 4x slower according to benchmarks). - introduces a doc-id set builder that supports random write access by starting with a sparse bit set and upgrading to a dense FixedBitSet when the cardinality of the set increases - changes MultiTermQueryWrapperFilter and TermsFilter to use this new builder - removes CONSTANT_SCORE_AUTO_REWRITE and makes CONSTANT_SCORE_FILTER_REWRITE the default For queries that match many documents ({{wikimedium10m.tasks}}, the builder always ends up building a FixedBitSet), this new patch can be slower or faster depending on the cost to iterate the matching terms (since they are enumerated only once now) vs. the cost of building the doc-id set. For queries that match few documents ({{low_freq.tasks}}, attached to this issue), this new patch makes things faster since it just sets a couple of bits in random order and then iterates over them instead of merging documents coming from several other iterators on the fly using a priority queue. Independently of the benchmarks, I think it's a good simplification to remove the constant-score auto rewrite mode. {noformat} wikimedium10m.tasks TaskQPS baseline StdDev QPS patch StdDev Pct diff IntNRQ 8.79 (9.6%) 8.41 (3.4%) -4.3% ( -15% - 9%) Fuzzy2 60.83 (11.1%) 58.34 (8.7%) -4.1% ( -21% - 17%) OrNotHighMed 98.35 (13.8%) 97.12 (10.9%) -1.3% ( -22% - 27%) OrHighNotHigh 18.88 (13.7%) 18.71 (11.1%) -0.9% ( -22% - 27%) OrNotHighHigh 17.10 (13.4%) 17.03 (11.2%) -0.4% ( -22% - 27%) OrNotHighLow 126.52 (13.6%) 126.85 (10.9%) 0.3% ( -21% - 28%) OrHighMed 76.90 (14.0%) 77.25 (11.4%) 0.5% ( -21% - 30%) OrHighNotLow 41.29 (14.3%) 41.51 (12.4%) 0.5% ( -22% - 31%) OrHighNotMed 57.70 (13.6%) 58.03 (11.6%) 0.6% ( -21% - 29%) OrHighLow 73.14 (14.7%) 73.74 (12.0%) 0.8% ( -22% - 32%) LowSloppyPhrase 127.22 (8.6%) 128.43 (3.8%) 1.0% ( -10% - 14%) OrHighHigh 29.11 (15.1%) 29.41 (12.2%) 1.0% ( -22% - 33%) HighSloppyPhrase 12.83 (10.4%) 13.02 (5.3%) 1.4% ( -12% - 19%) Prefix3 113.92 (9.9%) 115.71 (2.4%) 1.6% ( -9% - 15%) PKLookup 297.13 (9.2%) 302.03 (3.5%) 1.6% ( -10% - 15%) MedSloppyPhrase 38.60 (8.8%) 39.26 (3.7%) 1.7% ( -9% - 15%) AndHighHigh 71.39 (6.9%) 72.67 (0.9%) 1.8% ( -5% - 10%) HighTerm 87.17 (7.9%) 88.85 (2.1%) 1.9% ( -7% - 12%) MedPhrase 74.60 (9.3%) 76.10 (4.3%) 2.0% ( -10% - 17%) LowPhrase 21.67 (9.6%) 22.12 (4.0%) 2.1% ( -10% - 17%) AndHighMed 297.13 (9.4%) 303.73 (2.1%) 2.2% ( -8% - 15%) HighPhrase 16.65 (8.2%) 17.04 (3.7%) 2.3% ( -8% - 15%) HighSpanNear 4.56 (10.7%) 4.67 (6.1%) 2.4% ( -12% - 21%) LowTerm 769.53 (7.8%) 788.24 (2.0%) 2.4% ( -6% - 13%) AndHighLow 726.76 (10.6%) 744.51 (4.2%) 2.4% ( -11% - 19%) MedSpanNear 65.27 (9.3%) 67.00 (3.2%) 2.6% ( -9% - 16%) Wildcard 114.28 (9.1%) 118.05 (7.4%) 3.3% ( -12% - 21%) LowSpanNear 174.75 (10.3%) 180.83 (3.5%) 3.5% ( -9% - 19%) Fuzzy1 67.63 (11.3%) 70.08 (3.2%) 3.6% ( -9% - 20%) MedTerm 209.00 (9.3%) 216.71 (1.9%) 3.7% ( -6% - 16%) Respell 48.30 (10.6%) 50.58 (1.7%) 4.7% ( -6% - 18%) low_freq.tasks TaskQPS baseline StdDev QPS patch StdDev Pct diff PKLookup 278.50 (8.8%) 297.48 (13.9%) 6.8% ( -14% - 32%) Wildcard 124.50 (7.9%) 250.26 (19.3%) 101.0% ( 68% - 139%) {noformat} > New DocIdSet implementation with random write access > ---------------------------------------------------- > > Key: LUCENE-5938 > URL: https://issues.apache.org/jira/browse/LUCENE-5938 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Attachments: LUCENE-5938.patch, LUCENE-5938.patch, LUCENE-5938.patch, > LUCENE-5938.patch, low_freq.tasks > > > We have a great cost API that is supposed to help make decisions about how to > best execute queries. However, due to the fact that several of our filter > implementations (eg. TermsFilter and BooleanFilter) return FixedBitSets, > either we use the cost API and make bad decisions, or need to fall back to > heuristics which are not as good such as > RandomAccessFilterStrategy.useRandomAccess which decides that random access > should be used if the first doc in the set is less than 100. > On the other hand, we also have some nice compressed and cacheable DocIdSet > implementation but we cannot make use of them because TermsFilter requires a > DocIdSet that has random write access, and FixedBitSet is the only DocIdSet > that we have that supports random access. > I think it would be nice to replace FixedBitSet in those filters with another > DocIdSet that would also support random write access but would have a better > cost? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org