[ 
https://issues.apache.org/jira/browse/SPARK-16314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356401#comment-15356401
 ] 

Yesha Vora commented on SPARK-16314:
------------------------------------

Thanks [~jerryshao] for analysis.

{code}
Looking though the log, I think we're running into some RPC timeout and retry 
problems. In this scenario NM recovery is enabled:
1. we will kill and restart the NM, so this will run into a race condition 
where container is allocated and executor is starting to connect to external 
shuffle service, in this time if NM is failed, executor will be failed (cannot 
connect to external shuffle service).
2. Once executor is exited, driver will issue RPC requests to ask AM the reason 
about failure, in this situation failed executors are in the zombie status, 
which means driver will still keep the metadata of these executor, only when AM 
report back the results driver will clean the zombie executors. But in the NM 
failed situation, AM cannot get the failed container state until RPC timeout 
(120s), also timed out RPC will be retried (again wait until 120s timeout).
3. In the meantime If more than 3 executors are failed due to this issue AM and 
driver will be exited. At this time if NM is restarted, it will report failed 
containers to AM and AM will send RemoveExecutor to driver, at this time driver 
is already exited, so this message never be delivered, wait until timeout 
(120s) and retry.
So this cumulative timeout will hang the application exiting and delay 
reattempt of this application, that's why we saw the application is hang.
I think in this test, we're running into the corner case. 
{code}

> Spark application got stuck when NM running executor is restarted
> -----------------------------------------------------------------
>
>                 Key: SPARK-16314
>                 URL: https://issues.apache.org/jira/browse/SPARK-16314
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Yesha Vora
>
> Spark Application hangs if Nodemanager running executor is stopped.
> * start LogQuery application
> * This application starts 2 executors. Each in different nodes.
> * restart one of the nodemanagers.
> The application stays at 10% progress till 12 minutes. 
> Expected behavior: Application should either pass or fail. It should not 
> hang. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to