subject:"\[jira\] \[Updated\] \(LUCENE\-7055\) Better execution path for costly queries"

[jira] [Updated] (LUCENE-7055) Better execution path for costly queries

2017-01-04 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7055:
-
Attachment: LUCENE-7055.patch

Thanks for the great feedback! I did the following changes:
 - renamed {{estimateCost}} to {{estimatePointCount}}. I kept {{Point}} in the 
name to make it clear it was about points rather than docs.
 - renamed {{LazyScorer}} to {{ScorerSupplier}} to have a consistent naming 
with the JDK, hopefully that works for you?
 - fixed {{FakeScorerSupplier}} and added a test to test the tester
 - made the cost caching in {{Boolean2ScorerSupplier}} more explicit

bq. We don't need to implement it now, but I'm curious how we'll implement the 
cost method for multi term queries? It seems like merely computing the cost 
(enumerating all terms & summing their sumDocFreq) would be a big part of the 
overall cost of executing such queries. I guess we would also need a doc-values 
based query here too, e.g. one that checks the automaton on a binary doc values 
field or something?

Right, I did not think too much about these ones. When figuring out the number 
of matching terms is cheap (TermsQuery, TermRangeQuery, PrefixQuery), we could 
return eg. {{num_matching_terms * sum_doc_freq / size}}. For more complex 
automata, this looks more complicated however.

> Better execution path for costly queries
> 
>
> Key: LUCENE-7055
> URL: https://issues.apache.org/jira/browse/LUCENE-7055
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Attachments: LUCENE-7055.patch, LUCENE-7055.patch, LUCENE-7055.patch
>
>
> In Lucene 5.0, we improved the execution path for queries that run costly 
> operations on a per-document basis, like phrase queries or doc values 
> queries. But we have another class of costly queries, that return fine 
> iterators, but these iterators are very expensive to build. This is typically 
> the case for queries that leverage DocIdSetBuilder, like TermsQuery, 
> multi-term queries or the new point queries. Intersecting such queries with a 
> selective query is very inefficient since these queries build a doc id set of 
> matching documents for the entire index.
> Is there something we could do to improve the execution path for these 
> queries?
> One idea that comes to mind is that most of these queries could also run on 
> doc values, so maybe we could come up with something that would help decide 
> how to run a query based on other parts of the query? (Just thinking out 
> loud, other ideas are very welcome)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7055) Better execution path for costly queries

2017-01-03 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7055:
-
Attachment: LUCENE-7055.patch

Here is an updated patch. It adds docs, tests and improves the way BooleanQuery 
implements this new API so that cost estimations would be propagated lazily 
through deep query trees (similarly to the fact that two-phase iteration still 
executes slow bits in deep leaves after fast bits that are in the top-level 
BooleanQuery).

Only PointRangeQuery and BooleanQuery implement this API right now, since they 
are the two important ones to get started, but I think we might be able to 
generalize to other slow queries, like TermsQuery for instance, by leveraging 
the sumDocFreq index statistic. But this belongs to another issue.

Reviews are warmly welcome.

> Better execution path for costly queries
> 
>
> Key: LUCENE-7055
> URL: https://issues.apache.org/jira/browse/LUCENE-7055
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Attachments: LUCENE-7055.patch, LUCENE-7055.patch
>
>
> In Lucene 5.0, we improved the execution path for queries that run costly 
> operations on a per-document basis, like phrase queries or doc values 
> queries. But we have another class of costly queries, that return fine 
> iterators, but these iterators are very expensive to build. This is typically 
> the case for queries that leverage DocIdSetBuilder, like TermsQuery, 
> multi-term queries or the new point queries. Intersecting such queries with a 
> selective query is very inefficient since these queries build a doc id set of 
> matching documents for the entire index.
> Is there something we could do to improve the execution path for these 
> queries?
> One idea that comes to mind is that most of these queries could also run on 
> doc values, so maybe we could come up with something that would help decide 
> how to run a query based on other parts of the query? (Just thinking out 
> loud, other ideas are very welcome)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7055) Better execution path for costly queries

2016-12-23 Thread Adrien Grand (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7055:
-
Attachment: LUCENE-7055.patch

I have been looking at slow queries recently, which were slow for that exact 
reason. They were running a point range query that covered most of the index, 
intersected with selective queries, which is typically the case when doc values 
would perform better than points.

So I started exploring how we could improve this and got the following:
 - PointValues get a new method that computes an estimate of the cost of a 
visitor. For the Lucene70 codec it basically counts the numbers of leaf blocks 
that intersect the visitor, and multiplies that number by the number of points 
on leaf blocks.
 - A new API on Weight allows to get an estimate of the cost of a Scorer before 
building it. The underlying idea is that in the case of a conjunction that 
contains a range, the range should use points if it has the least cost (ie. it 
will lead the iteration) and doc values otherwise since the scorer will only be 
used to validate that the current document matched.

I attached a patch if someone is interested to look into how that works. I 
tried to make it as little invasive as possible: the new API on Weight is 
optional, and we do not need to implement giant queries that know both how to 
use points and doc values, instead there is a wrapper query called 
IndexOrDocValuesQuery that wraps both a point/index query and a doc values 
query and it figures out which one to use based on costs. It is neither 
complete not commitable, just a proof of concept to trigger some discussion.

> Better execution path for costly queries
> 
>
> Key: LUCENE-7055
> URL: https://issues.apache.org/jira/browse/LUCENE-7055
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Attachments: LUCENE-7055.patch
>
>
> In Lucene 5.0, we improved the execution path for queries that run costly 
> operations on a per-document basis, like phrase queries or doc values 
> queries. But we have another class of costly queries, that return fine 
> iterators, but these iterators are very expensive to build. This is typically 
> the case for queries that leverage DocIdSetBuilder, like TermsQuery, 
> multi-term queries or the new point queries. Intersecting such queries with a 
> selective query is very inefficient since these queries build a doc id set of 
> matching documents for the entire index.
> Is there something we could do to improve the execution path for these 
> queries?
> One idea that comes to mind is that most of these queries could also run on 
> doc values, so maybe we could come up with something that would help decide 
> how to run a query based on other parts of the query? (Just thinking out 
> loud, other ideas are very welcome)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7055) Better execution path for costly queries

[jira] [Updated] (LUCENE-7055) Better execution path for costly queries

[jira] [Updated] (LUCENE-7055) Better execution path for costly queries

3 matches

Site Navigation

Mail list logo

Footer information