Github user marmbrus commented on the issue:

    https://github.com/apache/spark/pull/20387
  
    Regarding, `computeStats`, the logical plan seems like it might not be the 
right place.  As we move towards more CBO it seems like we are going to need to 
pick physical operators before we can really reason about the cost of a 
subplan. With the caveat that I haven't though hard about this, I'd be 
supportive of moving these kinds of metrics to physical plan. +1 that we need 
to be able to consider pushdown when producing stats either way.
    
    On the second point, I don't think I understand DataSourceV2 enough yet to 
know the answer, but you ask a lot of questions that I think need to be defined 
as part of the API (if we haven't already).  What is the contract for ordering 
and interactions between different types of pushdown? Is it valid to pushdown 
in pieces or will we only call the method once? (sorry if this is written down 
and I've just missed it).
    
    My gut feeling is that we don't really want to fuse incrementally.  Its 
seems hard to reason about correctness and interactions between different 
things that have been pushed.  As I hinted at before, I think its most natural 
to split the concerns of pushdown within a query plan and fusing of operators. 
But maybe this is limited in someway I don't realize.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to