[jira] [Updated] (SPARK-9256) Message delay causes Master crash upon registering application

Colin Scott (JIRA) Wed, 22 Jul 2015 12:39:39 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Colin Scott updated SPARK-9256:
-------------------------------
    Description: 
This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.

It should be possible to trigger this bug even if the underlying transport 
protocol is TCP, since TCP only guarantees in-order delivery in each direction 
of the connection but not in both directions.

  was:
This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and I 
believe it is only possible to trigger in production when the AppClient and 
Master are on different machines.

As part of initialization, the AppClient 
[registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
 with the Master by repeatedly sending a RegisterApplication message until it 
receives a RegisteredApplication response.

If the RegisteredApplication response is delayed by at least 
REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
RegisterApplication RPC), it is possible for the Master to receive *two* 
RegisterApplication messages for the same AppClient.

Upon receiving the second RegisterApplication message, the master 
[attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
 to persist the ApplicationInfo to disk. Since the file already exists, 
FileSystemPersistenceEngine 
[throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
 an IllegalStateException, and the Master crashes.

Incidentally, it appears that there is already a 
[TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
 in the code to handle this scenario.

I have a reproducing scenario for this bug on an old version of Spark (1.0.1), 
but upon inspecting the latest version of the code it appears that it is still 
possible to trigger it. Let me know if you would like reproducing steps for 
triggering it on the old version of Spark.

It should be possible to trigger this bug even if the underlying transport 
protocol is TCP, since TCP only guarantees in-order delivery in each direction 
of the connection, but not in both directions.


> Message delay causes Master crash upon registering application
> --------------------------------------------------------------
>
>                 Key: SPARK-9256
>                 URL: https://issues.apache.org/jira/browse/SPARK-9256
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Colin Scott
>            Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and 
> I believe it is only possible to trigger in production when the AppClient and 
> Master are on different machines.
> As part of initialization, the AppClient 
> [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
>  with the Master by repeatedly sending a RegisterApplication message until it 
> receives a RegisteredApplication response.
> If the RegisteredApplication response is delayed by at least 
> REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the 
> RegisterApplication RPC), it is possible for the Master to receive *two* 
> RegisterApplication messages for the same AppClient.
> Upon receiving the second RegisterApplication message, the master 
> [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
>  to persist the ApplicationInfo to disk. Since the file already exists, 
> FileSystemPersistenceEngine 
> [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
>  an IllegalStateException, and the Master crashes.
> Incidentally, it appears that there is already a 
> [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
>  in the code to handle this scenario.
> I have a reproducing scenario for this bug on an old version of Spark 
> (1.0.1), but upon inspecting the latest version of the code it appears that 
> it is still possible to trigger it. Let me know if you would like reproducing 
> steps for triggering it on the old version of Spark.
> It should be possible to trigger this bug even if the underlying transport 
> protocol is TCP, since TCP only guarantees in-order delivery in each 
> direction of the connection but not in both directions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9256) Message delay causes Master crash upon registering application

Reply via email to