Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21230 Hi @rdblue , thanks for your new approach! Like you said, the major problem is about statistics. This is unfortunately a problem of Spark's CBO design: the statistics should belong to physical node but it currently belongs to logical node. For file-based data sources, since they are builtin sources, we can create rules to update statistics at logical phase, i.e. `PruneFileSourcePartitions`. But for external sources like iceberg, we would not be able to update statistics before planning, and shuffle join may be wrongly planned while broadcast join is applicable. In other words, users may need to create custom optimizer rules to make their data source work well. That said, I do like your approach if we can fix the statistics problem first. I'm not sure how hard and how soon it can be fixed, cc @wzhfy Before that, I'd like to still keep the pushdown logic in optimizer and left the hard work to Spark instead of users. What do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org