Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20387
  
    For doing pushdown at logical or physical phase, I don't have a strong 
preference. I think at logical phase we should try our best to push down 
data-size-reduction operators(like filter, aggregate, limit, etc.) close to the 
bottom of the plan, as it should always be good. We should apply pushdown to 
data sources at physical phase, as it's not always good and we need to consider 
the cost. Currently it's done in logical phase because of the `computeStats` 
problem. Eventually we should compute the statistics and apply pushdown to data 
sources in physical phase.
    
    About how to apply pushdown to data sources, I think `PhysicalOperation` is 
in the right direction and the new pushdown rule also follows it. Generally the 
logical phase is responsible for pushing down data-size-reduction operators 
close to the data source relation, and in the physical phase we collect 
supported operators(currently it's only project and filter) above the data 
source relation and apply the pushdown once, so this doesn't need to be 
incremental.
    
    We definitely need to document the contract for ordering and interactions 
between different types of pushdowns. For now we don't need to worry about it 
as we only support column pruning and filter push down, and these 2 are 
orthogonal, it doesn't matter if data source run project first or filter first. 
Here are some initial thoughts on how to define the contract.
    
    Let's say Data Source V2 framework supports pushing down required 
columns(column pruning), filter, limit, aggregate. Semantically filter, limit 
and aggregate are not exchangeable, we should respect their order in the query. 
If we have all these operators in a query, how to tell the data source about 
the order of these operators?
    
    My proposal is, since `DataSourceReader` is mutable(not the plan!), we can 
ask the data source to remember which operators have been pushed down, via the 
order of when these `pushXXX` methods are called. And data source 
implementations should respect the order of pushdown when applying them 
internally.
    
    About `PhysicalOperation`, it's pretty simple and we probably need to 
change it a lot when adding more operator pushdown. Another concern is, 
`PhysicalOperation` is used in a lot of places, not only data source pushdown. 
For safety, I wanna keep it unchanged, and start something new for data source 
v2 only.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to