[ https://issues.apache.org/jira/browse/SPARK-16314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356401#comment-15356401 ]
Yesha Vora commented on SPARK-16314: ------------------------------------ Thanks [~jerryshao] for analysis. {code} Looking though the log, I think we're running into some RPC timeout and retry problems. In this scenario NM recovery is enabled: 1. we will kill and restart the NM, so this will run into a race condition where container is allocated and executor is starting to connect to external shuffle service, in this time if NM is failed, executor will be failed (cannot connect to external shuffle service). 2. Once executor is exited, driver will issue RPC requests to ask AM the reason about failure, in this situation failed executors are in the zombie status, which means driver will still keep the metadata of these executor, only when AM report back the results driver will clean the zombie executors. But in the NM failed situation, AM cannot get the failed container state until RPC timeout (120s), also timed out RPC will be retried (again wait until 120s timeout). 3. In the meantime If more than 3 executors are failed due to this issue AM and driver will be exited. At this time if NM is restarted, it will report failed containers to AM and AM will send RemoveExecutor to driver, at this time driver is already exited, so this message never be delivered, wait until timeout (120s) and retry. So this cumulative timeout will hang the application exiting and delay reattempt of this application, that's why we saw the application is hang. I think in this test, we're running into the corner case. {code} > Spark application got stuck when NM running executor is restarted > ----------------------------------------------------------------- > > Key: SPARK-16314 > URL: https://issues.apache.org/jira/browse/SPARK-16314 > Project: Spark > Issue Type: Bug > Affects Versions: 1.6.1 > Reporter: Yesha Vora > > Spark Application hangs if Nodemanager running executor is stopped. > * start LogQuery application > * This application starts 2 executors. Each in different nodes. > * restart one of the nodemanagers. > The application stays at 10% progress till 12 minutes. > Expected behavior: Application should either pass or fail. It should not > hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org