[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

cloud-fan Tue, 18 Sep 2018 19:48:51 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16677
  
    Let me take an example from the PR description
    > For example, we have three partitions with rows (100, 100, 50) 
respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for 
each partition locally. After global limit we still have three partitions.
    
    Without this patch, we need to take the first 100 rows from each partition, 
and then perform a shuffle to send all data into one partition and take the 
first 100 rows.
    
    So if the limit is big, this patch is super useful, if the limit is small, 
this patch is not that useful but should not be slower.
    
    The only overhead I can think of is, `MapStatus` needs to carry the 
numRecords metrics. It should be a small overhead, as `MapStatus` already 
carries many information.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

Reply via email to