[jira] [Created] (SPARK-11131) Worker registration protocol is racy

Marcelo Vanzin (JIRA) Thu, 15 Oct 2015 10:08:06 -0700

Marcelo Vanzin created SPARK-11131:
--------------------------------------

             Summary: Worker registration protocol is racy
                 Key: SPARK-11131
                 URL: https://issues.apache.org/jira/browse/SPARK-11131
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.6.0
            Reporter: Marcelo Vanzin
            Priority: Minor



I ran into this while making changes to the new RPC framework. Because the 
Worker registration protocol is based on sending unrelated messages between 
Master and Worker, it's possible that another message (e.g. caused by an a app 
trying to allocate workers) to arrive at the Worker before it knows the Master 
has registered it. This triggers the following code:

{code}
    case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
      if (masterUrl != activeMasterUrl) {
        logWarning("Invalid Master (" + masterUrl + ") attempted to launch 
executor.")
{code}

This may or may not be made worse by SPARK-11098.

A simple workaround is to use an {{ask}} instead of a {{send}} for these 
messages. That should at least narrow the race. 

Note this is more of a problem in {{local-cluster}} mode, used a lot by unit 
tests, where Master and Worker instances are coming up as part of the app 
itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11131) Worker registration protocol is racy

Reply via email to