[ 
https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Scott updated SPARK-9256:
-------------------------------
    Priority: Major  (was: Minor)

> Message delay causes Master crash upon registering application
> --------------------------------------------------------------
>
>                 Key: SPARK-9256
>                 URL: https://issues.apache.org/jira/browse/SPARK-9256
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Colin Scott
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and 
> I believe it is only possible to trigger in production when the AppClient and 
> Master are on different machines.
> As part of initialization, the AppClient 
> [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
>  with the Master by repeatedly sending a RegisterApplication message until it 
> receives a RegisteredApplication response.
> If the RegisteredApplication response is delayed by at least 
> REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
> RegisterApplication RPC), it is possible for the Master to receive *two* 
> RegisterApplication messages for the same AppClient.
> Upon receiving the second RegisterApplication message, the master 
> [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
>  to persist the ApplicationInfo to disk. Since the file already exists, 
> FileSystemPersistenceEngine 
> [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
>  an IllegalStateException, and the Master crashes.
> Incidentally, it appears that there is already a 
> [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
>  in the code to handle this scenario.
> I have a reproducing scenario for this bug on an old version of Spark 
> (1.0.1), but upon inspecting the latest version of the code it appears that 
> it is still possible to trigger it. Let me know if you would like reproducing 
> steps for triggering it on the old version of Spark.
> It should be possible to trigger this bug even if the underlying transport 
> protocol is TCP, since TCP only guarantees in-order delivery in each 
> direction of the connection but not in both directions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to