[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

viirya Wed, 10 Oct 2018 03:09:01 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16677
  
    @sujith71955 For `executeTake`, to optimize it we need to collect 
statistics of RDD. `executeTake` incrementally scans partitions. Ideally, it 
should just scan few partitions to return `n` rows, and remaining partitions 
can be skipped and don't need to be materialized. So going back to the 
beginning, IMHO, if we are going to collect the statistics, we will materialize 
all partitions, and that seems to be opposite to `executeTake`'s optimization.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

Reply via email to