[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

gaoyajun02 (Jira) Mon, 24 Jan 2022 18:58:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


gaoyajun02 updated SPARK-38010:
-------------------------------
    Description: 
The current shuffle merger locations is obtained based on the host of the 
active or dead Executors.
When dynamic executor allocation is enabled, when an application submits the 
first few stages, there are often not enough locations to satisfy the push 
merge, which causes most shuffles to not benefit from the push bashed shuffle.
The first few shuffle write stages of spark applications are generally the 
stages for reading tables or data sources, which account for a large amount of 
shuffled data. Because push merge shuffle is disabled, the end-to-end 
improvement of spark applications is very limited.

I probably thought of a way, but not sure if it's possible：
 *  Lazy initialize shuffle merger locations, After the mapper writes the local 
shuffle data, it obtains the merge location in the push thread.

Looking for advice and solutions on this issue

  was:
The current shuffle merger position is obtained based on the host of the active 
or dead Executor.
When dynamic resource allocation is enabled, when the application submits the 
first few stages, there are often not enough locations to satisfy the push 
merge, which causes most shuffles to not benefit from the push bashed shuffle.
The first few shuffle write stages of spark applications are generally the 
stages for reading tables or data sources, which account for a large amount of 
shuffled data and the proportion of data. Because push cannot be used, the 
end-to-end improvement of spark applications is very limited.

I probably thought of a way, but not sure if it's possible：
 *  Lazy initialize shuffle merger locations, After the mapper writes the local 
shuffle data, it obtains the merge location in the push thread.

Looking for advice and solutions on this issue


> Push-based shuffle disabled due to insufficient mergeLocations
> --------------------------------------------------------------
>
>                 Key: SPARK-38010
>                 URL: https://issues.apache.org/jira/browse/SPARK-38010
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: gaoyajun02
>            Priority: Major
>
> The current shuffle merger locations is obtained based on the host of the 
> active or dead Executors.
> When dynamic executor allocation is enabled, when an application submits the 
> first few stages, there are often not enough locations to satisfy the push 
> merge, which causes most shuffles to not benefit from the push bashed shuffle.
> The first few shuffle write stages of spark applications are generally the 
> stages for reading tables or data sources, which account for a large amount 
> of shuffled data. Because push merge shuffle is disabled, the end-to-end 
> improvement of spark applications is very limited.
> I probably thought of a way, but not sure if it's possible：
>  *  Lazy initialize shuffle merger locations, After the mapper writes the 
> local shuffle data, it obtains the merge location in the push thread.
> Looking for advice and solutions on this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38010) Push-based shuffle disabled due to insufficient mergeLocations

Reply via email to