[jira] [Updated] (LUCENE-7055) Better execution path for costly queries
[ https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7055: - Attachment: LUCENE-7055.patch Thanks for the great feedback! I did the following changes: - renamed {{estimateCost}} to {{estimatePointCount}}. I kept {{Point}} in the name to make it clear it was about points rather than docs. - renamed {{LazyScorer}} to {{ScorerSupplier}} to have a consistent naming with the JDK, hopefully that works for you? - fixed {{FakeScorerSupplier}} and added a test to test the tester - made the cost caching in {{Boolean2ScorerSupplier}} more explicit bq. We don't need to implement it now, but I'm curious how we'll implement the cost method for multi term queries? It seems like merely computing the cost (enumerating all terms & summing their sumDocFreq) would be a big part of the overall cost of executing such queries. I guess we would also need a doc-values based query here too, e.g. one that checks the automaton on a binary doc values field or something? Right, I did not think too much about these ones. When figuring out the number of matching terms is cheap (TermsQuery, TermRangeQuery, PrefixQuery), we could return eg. {{num_matching_terms * sum_doc_freq / size}}. For more complex automata, this looks more complicated however. > Better execution path for costly queries > > > Key: LUCENE-7055 > URL: https://issues.apache.org/jira/browse/LUCENE-7055 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Attachments: LUCENE-7055.patch, LUCENE-7055.patch, LUCENE-7055.patch > > > In Lucene 5.0, we improved the execution path for queries that run costly > operations on a per-document basis, like phrase queries or doc values > queries. But we have another class of costly queries, that return fine > iterators, but these iterators are very expensive to build. This is typically > the case for queries that leverage DocIdSetBuilder, like TermsQuery, > multi-term queries or the new point queries. Intersecting such queries with a > selective query is very inefficient since these queries build a doc id set of > matching documents for the entire index. > Is there something we could do to improve the execution path for these > queries? > One idea that comes to mind is that most of these queries could also run on > doc values, so maybe we could come up with something that would help decide > how to run a query based on other parts of the query? (Just thinking out > loud, other ideas are very welcome) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7055) Better execution path for costly queries
[ https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7055: - Attachment: LUCENE-7055.patch Here is an updated patch. It adds docs, tests and improves the way BooleanQuery implements this new API so that cost estimations would be propagated lazily through deep query trees (similarly to the fact that two-phase iteration still executes slow bits in deep leaves after fast bits that are in the top-level BooleanQuery). Only PointRangeQuery and BooleanQuery implement this API right now, since they are the two important ones to get started, but I think we might be able to generalize to other slow queries, like TermsQuery for instance, by leveraging the sumDocFreq index statistic. But this belongs to another issue. Reviews are warmly welcome. > Better execution path for costly queries > > > Key: LUCENE-7055 > URL: https://issues.apache.org/jira/browse/LUCENE-7055 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Attachments: LUCENE-7055.patch, LUCENE-7055.patch > > > In Lucene 5.0, we improved the execution path for queries that run costly > operations on a per-document basis, like phrase queries or doc values > queries. But we have another class of costly queries, that return fine > iterators, but these iterators are very expensive to build. This is typically > the case for queries that leverage DocIdSetBuilder, like TermsQuery, > multi-term queries or the new point queries. Intersecting such queries with a > selective query is very inefficient since these queries build a doc id set of > matching documents for the entire index. > Is there something we could do to improve the execution path for these > queries? > One idea that comes to mind is that most of these queries could also run on > doc values, so maybe we could come up with something that would help decide > how to run a query based on other parts of the query? (Just thinking out > loud, other ideas are very welcome) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7055) Better execution path for costly queries
[ https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7055: - Attachment: LUCENE-7055.patch I have been looking at slow queries recently, which were slow for that exact reason. They were running a point range query that covered most of the index, intersected with selective queries, which is typically the case when doc values would perform better than points. So I started exploring how we could improve this and got the following: - PointValues get a new method that computes an estimate of the cost of a visitor. For the Lucene70 codec it basically counts the numbers of leaf blocks that intersect the visitor, and multiplies that number by the number of points on leaf blocks. - A new API on Weight allows to get an estimate of the cost of a Scorer before building it. The underlying idea is that in the case of a conjunction that contains a range, the range should use points if it has the least cost (ie. it will lead the iteration) and doc values otherwise since the scorer will only be used to validate that the current document matched. I attached a patch if someone is interested to look into how that works. I tried to make it as little invasive as possible: the new API on Weight is optional, and we do not need to implement giant queries that know both how to use points and doc values, instead there is a wrapper query called IndexOrDocValuesQuery that wraps both a point/index query and a doc values query and it figures out which one to use based on costs. It is neither complete not commitable, just a proof of concept to trigger some discussion. > Better execution path for costly queries > > > Key: LUCENE-7055 > URL: https://issues.apache.org/jira/browse/LUCENE-7055 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand > Attachments: LUCENE-7055.patch > > > In Lucene 5.0, we improved the execution path for queries that run costly > operations on a per-document basis, like phrase queries or doc values > queries. But we have another class of costly queries, that return fine > iterators, but these iterators are very expensive to build. This is typically > the case for queries that leverage DocIdSetBuilder, like TermsQuery, > multi-term queries or the new point queries. Intersecting such queries with a > selective query is very inefficient since these queries build a doc id set of > matching documents for the entire index. > Is there something we could do to improve the execution path for these > queries? > One idea that comes to mind is that most of these queries could also run on > doc values, so maybe we could come up with something that would help decide > how to run a query based on other parts of the query? (Just thinking out > loud, other ideas are very welcome) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org