[ 
https://issues.apache.org/jira/browse/SPARK-33158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215244#comment-17215244
 ] 

dzcxzl edited comment on SPARK-33158 at 10/16/20, 8:08 AM:
-----------------------------------------------------------

[SPARK-13​​669|https://issues.apache.org/jira/browse/SPARK-13669] / 
[SPARK-20898|https://issues.apache.org/jira/browse/SPARK-20898] provides the 
ability to add host to the blacklist when fetch fails, 
[SPARK-27272|https://issues.apache.org/jira/browse/SPARK-27272] tries to enable 
this feature by default.

If we want to avoid this problem, we can configure
spark.blacklist.enabled=true
spark.blacklist.application.fetchFailure.enabled=true

Sometimes we will stop nm or decommission nm for a period of time,nm does not 
guarantee that all container processes will be killed when stopping, it may 
appear that the container is still executing, nm does not provide shuffle 
service,which will cause fetch fail. 
https://issues.apache.org/jira/browse/YARN-72?focusedCommentId=13505398&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13505398

Although spark.files.fetchFailure.unRegisterOutputOnHost can be turned on to 
remove all shuffle files of the host, it may still be assigned to this host 
when the stage is rerun. Since the executor does not know whether the shuffle 
service is available, it continues to write data to disk , the next round of 
shuffle read will fail again.



was (Author: dzcxzl):
[SPARK-13​​669|https://issues.apache.org/jira/browse/SPARK-13669] / 
[SPARK-20898|https://issues.apache.org/jira/browse/SPARK-20898] provides the 
ability to add host to the blacklist when fetch fails, 
[SPARK-27272|https://issues.apache.org/jira/browse/SPARK-27272] tries to enable 
this feature by default.

If we want to avoid this problem, we can configure
spark.blacklist.enabled=true
spark.blacklist.application.fetchFailure.enabled=true

Sometimes we will stop nm or decommission nm for a period of time,nm does not 
guarantee that all container processes will be killed when stopping, it may 
appear that the container is still executing, nm does not provide shuffle 
service,which will cause fetch fail. 
https://issues.apache.org/jira/browse/YARN-72?focusedCommentId=13505398&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13505398

Although spark.files.fetchFailure.unRegisterOutputOnHost can be turned on to 
remove all shuffle files of the host, it may still be assigned to this host 
when the stage is rerun. Since the executor does not know whether the shuffle 
service is available, it continues to write data to disk.


> Check whether the executor and external service connection is available
> -----------------------------------------------------------------------
>
>                 Key: SPARK-33158
>                 URL: https://issues.apache.org/jira/browse/SPARK-33158
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.1
>            Reporter: dzcxzl
>            Priority: Trivial
>
> At present, the executor only establishes a connection with the external 
> shuffle service once at initialization and registers.
> In yarn, nodemanager may stop working, shuffle service does not work, but the 
> container/executor process is still executing, ShuffleMapTask can still be 
> executed, and the returned result mapstatus is still the address of the 
> external shuffle service
> When the next stage reads shuffle data, it will not be connected to the 
> shuffle serivce.
> The final job execution failed.
> The approach I thought of:
> Before ShuffleMapTask starts to write data, check whether the connection is 
> available, or regularly test whether the connection is normal, such as the 
> driver and executor heartbeat check threads.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to