subject:"\[jira\] \[Commented\] \(SPARK\-41953\) Shuffle output location refetch during shuffle migration in decommission"

[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656480#comment-17656480
 ] 

Mridul Muralidharan commented on SPARK-41953:
-


A few things:

* Looking at SPARK-27637, we should revisit it - the {{IsExecutorAlive}} 
request does not make sense in case of dynamic resource allocation (DRA) when 
an external shuffle service (ESS) is enabled : we should not be making that 
call. Thoughts [~Ngone51] ? +CC [~csingh]
This also means, relying on ExecutorDeadException when DRA is enabled with ESS 
is configured wont be useful.

For rest of the proposal ...

For (2),  I am not sure about 'Make MapOutputTracker support fetch latest 
output without epoch provided.' - this could have nontrivial interaction with 
other things, and I will need to think through it. Not sure if we can model 
node decommission - where we have block moved from host A to host B without any 
other change - as not requiring an epoch update, requires analysis 

Assuming we sort out how to get updated state, (3) looks like a reasonable 
approach.




> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2023-01-09 Thread Zhongwei Zhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656365#comment-17656365
 ] 

Zhongwei Zhu commented on SPARK-41953:
--

[~dongjoon] [~mridulm80] [~Ngone51] Any comments for this?

> Shuffle output location refetch during shuffle migration in decommission
> 
>
> Key: SPARK-41953
> URL: https://issues.apache.org/jira/browse/SPARK-41953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.1
>Reporter: Zhongwei Zhu
>Priority: Major
>
> When shuffle migration enabled during spark decommissionm, shuffle data will 
> be migrated into live executors, then update latest location to 
> MapOutputTracker. It has some issues:
>  # Executors only do map output location fetch in the beginning of the reduce 
> stage, so any shuffle output location change in the middle of reduce will 
> cause FetchFailed as reducer fetch from old location. Even stage retries 
> could solve this, this still cause lots of resource waste as all shuffle read 
> and compute happened before FetchFailed partition will be wasted.
>  # During stage retries, less running tasks cause more executors to be 
> decommissioned and shuffle data location keep changing. In the worst case, 
> stage could need lots of retries, further breaking SLA.
> So I propose to support refetch map output location during reduce phase if 
> shuffle migration is enabled and FetchFailed is caused by a decommissioned 
> dead executor. The detailed steps as below:
>  # When `BlockTransferService` fetch blocks failed from a decommissioned dead 
> executor, ExecutorDeadException(isDecommission as true) will be thrown.
>  # Make MapOutputTracker support fetch latest output without epoch provided.
>  # `ShuffleBlockFetcherIterator` will refetch latest output from 
> MapOutputTrackMaster. For all the shuffle blocks on this decommissioned, 
> there should be a new location on another executor. If not, throw exception 
> as current. If yes, create new local and remote requests to fetch these 
> migrated shuffle blocks. The flow will be similar as failback fetch when push 
> merged fetch failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

[jira] [Commented] (SPARK-41953) Shuffle output location refetch during shuffle migration in decommission

2 matches

Site Navigation

Mail list logo

Footer information