Github user viirya commented on the issue: https://github.com/apache/spark/pull/16677 @sujith71955 For `executeTake`, to optimize it we need to collect statistics of RDD. `executeTake` incrementally scans partitions. Ideally, it should just scan few partitions to return `n` rows, and remaining partitions can be skipped and don't need to be materialized. So going back to the beginning, IMHO, if we are going to collect the statistics, we will materialize all partitions, and that seems to be opposite to `executeTake`'s optimization.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org