Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/20387
For doing pushdown at logical or physical phase, I don't have a strong
preference. I think at logical phase we should try our best to push down
data-size-reduction operators(like filter, aggregate, limit, etc.) close to the
bottom of the plan, as it should always be good. We should apply pushdown to
data sources at physical phase, as it's not always good and we need to consider
the cost. Currently it's done in logical phase because of the `computeStats`
problem. Eventually we should compute the statistics and apply pushdown to data
sources in physical phase.
About how to apply pushdown to data sources, I think `PhysicalOperation` is
in the right direction and the new pushdown rule also follows it. Generally the
logical phase is responsible for pushing down data-size-reduction operators
close to the data source relation, and in the physical phase we collect
supported operators(currently it's only project and filter) above the data
source relation and apply the pushdown once, so this doesn't need to be
incremental.
We definitely need to document the contract for ordering and interactions
between different types of pushdowns. For now we don't need to worry about it
as we only support column pruning and filter push down, and these 2 are
orthogonal, it doesn't matter if data source run project first or filter first.
Here are some initial thoughts on how to define the contract.
Let's say Data Source V2 framework supports pushing down required
columns(column pruning), filter, limit, aggregate. Semantically filter, limit
and aggregate are not exchangeable, we should respect their order in the query.
If we have all these operators in a query, how to tell the data source about
the order of these operators?
My proposal is, since `DataSourceReader` is mutable(not the plan!), we can
ask the data source to remember which operators have been pushed down, via the
order of when these `pushXXX` methods are called. And data source
implementations should respect the order of pushdown when applying them
internally.
About `PhysicalOperation`, it's pretty simple and we probably need to
change it a lot when adding more operator pushdown. Another concern is,
`PhysicalOperation` is used in a lot of places, not only data source pushdown.
For safety, I wanna keep it unchanged, and start something new for data source
v2 only.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]