[ 
https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19355:
------------------------------------

    Assignee:     (was: Apache Spark)

> Use map output statistices to improve global limit's parallelism
> ----------------------------------------------------------------
>
>                 Key: SPARK-19355
>                 URL: https://issues.apache.org/jira/browse/SPARK-19355
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>
> A logical Limit is performed actually by two physical operations LocalLimit 
> and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to 
> shuffle data to single partition. When the limit number is very big, we 
> shuffle a lot of data to a single partition and significantly reduce 
> parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single 
> partition. Instead, we perform the map stage of the shuffling and collect the 
> statistics of the number of rows in each partition. Shuffled data are 
> actually all retrieved locally without from remote executors.
> Once we get the number of output rows in each partition, we only take the 
> required number of rows from the locally shuffled data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to