[jira] [Updated] (SPARK-32091) Ignore timeout error when remove blocks on the lost executor

wuyi (Jira) Wed, 24 Jun 2020 08:14:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wuyi updated SPARK-32091:
-------------------------
    Description: 
When removing blocks(e.g. RDD, broadcast, shuffle), BlockManagerMaserEndpoint 
will make RPC calls to each known BlockManagerSlaveEndpoint to remove the 
specific blocks. The PRC call sometimes could end in a timeout when the 
executor has been lost, but only notified the BlockManagerMasterEndpoint after 
the removing call has already happened. The timeout could therefore fail the 
whole query.

In this case, we actually could just ignore the error since those blocks on the 
lost executor could be considered as removed already.

  was:
When removing blocks(e.g. RDD, broadcast, shuffle), BlockManagerMaserEndpoint 
will make RPC calls to each known BlockManagerSlaveEndpoint to remove the 
specific blocks. The PRC call sometimes could end in a timeout when the 
executor has been lost, but only notified the BlockManagerSlaveEndpoint after 
the removing call has already happened. The timeout could therefore fail the 
whole query.

In this case, we actually could just ignore the error since those blocks on the 
lost executor could be considered as removed already.


> Ignore timeout error when remove blocks on the lost executor
> ------------------------------------------------------------
>
>                 Key: SPARK-32091
>                 URL: https://issues.apache.org/jira/browse/SPARK-32091
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: wuyi
>            Priority: Major
>
> When removing blocks(e.g. RDD, broadcast, shuffle), BlockManagerMaserEndpoint 
> will make RPC calls to each known BlockManagerSlaveEndpoint to remove the 
> specific blocks. The PRC call sometimes could end in a timeout when the 
> executor has been lost, but only notified the BlockManagerMasterEndpoint 
> after the removing call has already happened. The timeout could therefore 
> fail the whole query.
> In this case, we actually could just ignore the error since those blocks on 
> the lost executor could be considered as removed already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32091) Ignore timeout error when remove blocks on the lost executor

Reply via email to