Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/19864 @viirya yes, we can get more accurate stats later, however, the first stats is also important as it enables the user to pay less for `the first run` which writes cache. The current implementation always chooses the most expensive plan in the first run, e.g. always resort to sortmergejoin instead of broadcastjoin even it is possible, CBO is actually disabled for any operator which locates in downstream of InMemoryRelation. Additionally, it makes execution plan inconsistent even for the same query over the same dataset. Of course, all of these issues happen in the first run. IMHO, we have a chance to make it better, why not?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org