askalt opened a new issue, #19929:
URL: https://github.com/apache/datafusion/issues/19929
This issue covers two related filter push-down improvements.
## Pass previously pushed filters to supports_filters_pushdown
Currently, the optimization does not pass filters that were pushed in a
previous run (`TableScan::filters`) to
`TableProvider::supports_filters_pushdown(...)`.
If the optimizer runs multiple times, it may try to push filters into the
table provider multiple times. In our DataFusion-based project,
`supports_filters_pushdown(...)` has context-dependent behavior: the provider
supports any single filter like `column = value`, but not multiple such filters
at the same time.
Consider the following optimizer pipeline pattern:
1. Try to push `a = 1, b = 1`.
`supports_filters_pushdown` returns `[Exact, Inexact]`
OK: the optimizer records that `a = 1` is pushed and creates a filter
node for `b = 1`.
...
Another optimization iteration.
2. Try to push b = 1.
`supports_filters_pushdown` returns `[Exact]`. Of course, the table
provider can’t remember
all previously pushed filters, so it has no choice but to answer `Exact`.
Now, the optimizer thinks the conjunction `a = 1 AND b = 1` is supported
exactly, but it is not.
To prevent this problem, I suggest passing filters that were already pushed
into the scan earlier to `supports_filters_pushdown(...)`.
## Do not assume that filter support decision is stable
Consider the next scenario:
1. `supports_filters_pushdown` returns `Exact` on some filter, e.g. "a = 1",
where column "a" is not
required by the query projection.
2. "a" is removed from the table provider projection by "optimize
projection" rule.
3. `supports_filters_pushdown` changes a decision and returns `Inexact` on
this filter the next time.
For example, input filters were changed and it prefers to use a new one.
4. "a" is not returned to the table provider projection which leads to
filter that references a column which is
not a part of the schema.
Suggest to extend logic with the following actions:
1. Collect columns that are not used in the current table provider
projection, but required for filter
expressions. Call it `additional_projection`.
2. If `additional_projection` is empty -- leave all as is.
3. Otherwise extend a table provider projection and wrap a plan with an
additional projection node
to preserve schema used prior to this rule.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]