Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21230
  
    Hi @rdblue , thanks for your new approach! Like you said, the major problem 
is about statistics. This is unfortunately a problem of Spark's CBO design: the 
statistics should belong to physical node but it currently belongs to logical 
node.
    
    For file-based data sources, since they are builtin sources, we can create 
rules to update statistics at logical phase, i.e. `PruneFileSourcePartitions`. 
But for external sources like iceberg, we would not be able to update 
statistics before planning, and shuffle join may be wrongly planned while 
broadcast join is applicable. In other words, users may need to create custom 
optimizer rules to make their data source work well.
    
    That said, I do like your approach if we can fix the statistics problem 
first. I'm not sure how hard and how soon it can be fixed, cc @wzhfy 
    
    Before that, I'd like to still keep the pushdown logic in optimizer and 
left the hard work to Spark instead of users. What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to