[ 
https://issues.apache.org/jira/browse/SPARK-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27637:
----------------------------
    Component/s: Shuffle

> If exception occured while  fetching blocks by netty block transfer service, 
> check whether the relative executor is alive before retry
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27637
>                 URL: https://issues.apache.org/jira/browse/SPARK-27637
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager, Shuffle
>    Affects Versions: 2.3.2, 2.4.2
>            Reporter: feiwang
>            Priority: Major
>
> There are several kinds of shuffle client, blockTransferService and 
> externalShuffleClient.
> For the externalShuffleClient,  there are relative external shuffle service, 
> which guarantees  the shuffle block data and regardless the  state of 
> executors.
> For the blockTransferService, it is used to fetch broadcast block, and fetch 
> the shuffle data when external shuffle service is not enabled. 
> When fetching data by using blockTransferService, the shuffle client would 
> connect relative executor's blockManager, so if the relative executor is 
> dead, it would never fetch successfully.
> When spark.shuffle.service.enabled is true and 
> spark.dynamicAllocation.enabled is true,  the executor will be removed while 
> it has been idle  for more than idleTimeout.
> If a blockTransferService create connection to relative executor 
> successfully, but the relative executor is removed when beginning to fetch 
> broadcast block, it would retry (see RetryingBlockFetcher), which is 
> Ineffective.
> If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big,  
> such as 30s and 10 times, it would waste 5 minutes.
> So, I think we should judge whether the relative executor is alive before 
> retry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to