Colin Scott created SPARK-9256:
----------------------------------

             Summary: Message delay causes Master crash upon registering 
application
                 Key: SPARK-9256
                 URL: https://issues.apache.org/jira/browse/SPARK-9256
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Colin Scott
            Priority: Minor


This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to