GitHub user zsxwing opened a pull request:

    https://github.com/apache/spark/pull/11455

    Sync worker's state after registering with master

    ## What changes were proposed in this pull request?
    
    Here lists all cases that Master cannot talk with Worker for a while and 
then network is back.
    
    1. Master doesn't know the network issue (not yet timeout)
    
      a. Worker doesn't know the network issue (onDisconnected is not called)
        - Worker keeps sending Heartbeat. Both Worker and Master don't know the 
network issue. Nothing to do. (Finally, Master will notice the heartbeat 
timeout if network is not recovered)
    
      b. Worker knows the network issue (onDisconnected is called)
        - Worker stops sending Heartbeat and sends `RegisterWorker` to master. 
Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls 
"System.exit(1)" (Finally, Master will notice the heartbeat timeout if network 
is not recovered) (May leak driver processes. See 
[SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602))
    
    2. Work disconnect and timeout (Master knows the network issue). In such 
case,  master removes Worker and its executors and drivers.
    
      a. Worker doesn't know the network issue (onDisconnected is not called)
        - Worker keeps sending Heartbeat.
        - If the network is back, say Master receives Heartbeat, Master sends 
`ReconnectWorker` to Worker
        - Worker send `RegisterWorker` to master.
        - Master accepts `RegisterWorker` but doesn't know executors and 
drivers in Worker. (may leak executors)
    
      b. Worker knows the network issue (onDisconnected is called)
        - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to 
master.
        - Master accepts `RegisterWorker` but doesn't know executors and 
drivers in Worker. (may leak executors)
    
    This PR fixes executors and drivers leak in 2.a and 2.b when Worker 
reregisters with Master. The approach is making Worker send 
`WorkerSchedulerStateResponse` to sync the state after registering with master 
successfully. Then Master will ask Worker to kill unknown executors and drivers.
    
    ## How was this patch tested?
    
    This patch should not break existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zsxwing/spark orphan-executors

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11455.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11455
    
----
commit 6c13702ea10973af27885c3ecaa4213f2f3f0892
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-03-02T00:19:32Z

    Sync worker's state after registering with master

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to