Shixiong Zhu created SPARK-15262:
------------------------------------

             Summary: race condition in killing an executor and reregistering 
an executor
                 Key: SPARK-15262
                 URL: https://issues.apache.org/jira/browse/SPARK-15262
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Shixiong Zhu


There is a race condition when killing an executor and reregistering an 
executor happen at the same time. Here is the execution steps to reproduce it.

1. master find a worker is dead
2. master tells driver to remove executor
3. driver remove executor
4. BlockManagerMasterEndpoint remove the block manager
5. executor finds it's not reigstered via heartbeat
6. executor send reregister block manager
7. register block manager
8. executor is killed by worker
9. CoarseGrainedSchedulerBackend ignores onDisconnected as this address is not 
in the executor list
10. BlockManagerMasterEndpoint.blockManagerInfo contains dead block managers

As BlockManagerMasterEndpoint.blockManagerInfo contains some dead block 
managers, when we unpersist a RDD, remove a broadcast, or clean a shuffle block 
via a RPC endpoint of a dead block manager, we will get ClosedChannelException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to