[GitHub] spark pull request: Block Manager - Double Register Crash

tsliwowicz Mon, 20 Oct 2014 03:13:32 -0700

GitHub user tsliwowicz opened a pull request:

    https://github.com/apache/spark/pull/2854


    Block Manager - Double Register Crash

       In long running contexts, we encountered the situation of double 
register without a remove in between. The cause for that is unknown, and 
assumed a temp network issue.
        
        However, since the second register is with a BlockManagerId on a 
different port, blockManagerInfo.contains() returns false, while 
blockManagerIdByExecutor returns Some. This inconsistency is caught in a 
conditional statement that does System.exit(1), which is a huge robustness 
issue for us.
        
        The fix - simply remove the old id from both maps during register when 
this happens. We are mimicking the behavior of expireDeadHosts(), by doing 
local cleanup of the maps before trying to add new ones.
        
        Also - added some logging for register and unregister.
    
    https://issues.apache.org/jira/browse/SPARK-4006
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/taboola/spark branch-0.9.2-block-mgr-removal

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2854.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2854
    
----
commit efd93f2026ddc427e84fa03e8a595ded2b1a81ce
Author: Tal Sliwowicz <ta...@taboola.com>
Date:   2014-10-12T08:35:20Z

    In long running contexts, we encountered the situation of double register 
without a remove in between. The cause for that is unknown, and assumed a temp 
network issue.
    
    However, since the second register is with a BlockManagerId on a different 
port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor 
returns Some. This inconsistency is caught in a conditional statement that does 
System.exit(1), which is a huge robustness issue for us.
    
    The fix - simply remove the old id from both maps during register when this 
happens. We are mimicking the behavior of expireDeadHosts(), by doing local 
cleanup of the maps before trying to add new ones.
    
    Also - added some logging for register and unregister.

commit 81d69f088e421b19e47495d06e8b187a0ec29075
Author: Tal Sliwowicz <ta...@taboola.com>
Date:   2014-10-12T08:41:53Z

    fixed comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Block Manager - Double Register Crash

Reply via email to