[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin resolved SPARK-17022. ------------------------------------ Resolution: Fixed Assignee: Tao Wang Fix Version/s: 2.1.0 2.0.1 > Potential deadlock in driver handling message > --------------------------------------------- > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 > Reporter: Tao Wang > Assignee: Tao Wang > Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org