[ https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-19355: ------------------------------------ Assignee: (was: Apache Spark) > Use map output statistices to improve global limit's parallelism > ---------------------------------------------------------------- > > Key: SPARK-19355 > URL: https://issues.apache.org/jira/browse/SPARK-19355 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Liang-Chi Hsieh > > A logical Limit is performed actually by two physical operations LocalLimit > and GlobalLimit. > In most of time, before GlobalLimit, we will perform a shuffle exchange to > shuffle data to single partition. When the limit number is very big, we > shuffle a lot of data to a single partition and significantly reduce > parallelism, except for the cost of shuffling. > This change tries to perform GlobalLimit without shuffling data to single > partition. Instead, we perform the map stage of the shuffling and collect the > statistics of the number of rows in each partition. Shuffled data are > actually all retrieved locally without from remote executors. > Once we get the number of output rows in each partition, we only take the > required number of rows from the locally shuffled data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org