[ https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Colin Scott closed SPARK-9256. ------------------------------ Resolution: Invalid > Message delay causes Master crash upon registering application > -------------------------------------------------------------- > > Key: SPARK-9256 > URL: https://issues.apache.org/jira/browse/SPARK-9256 > Project: Spark > Issue Type: Bug > Components: Spark Core > Reporter: Colin Scott > Original Estimate: 1h > Remaining Estimate: 1h > > This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and > I believe it is only possible to trigger in production when the AppClient and > Master are on different machines. > As part of initialization, the AppClient > [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] > with the Master by repeatedly sending a RegisterApplication message until it > receives a RegisteredApplication response. > If the RegisteredApplication response is delayed by at least > REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the > RegisterApplication RPC), it is possible for the Master to receive *two* > RegisterApplication messages for the same AppClient. > Upon receiving the second RegisterApplication message, the master > [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] > to persist the ApplicationInfo to disk. Since the file already exists, > FileSystemPersistenceEngine > [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] > an IllegalStateException, and the Master crashes. > Incidentally, it appears that there is already a > [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] > in the code to handle this scenario. > I have a reproducing scenario for this bug on an old version of Spark > (1.0.1), but upon inspecting the latest version of the code it appears that > it is still possible to trigger it. Let me know if you would like reproducing > steps for triggering it on the old version of Spark. > It should be possible to trigger this bug even if the underlying transport > protocol is TCP, since TCP only guarantees in-order delivery in each > direction of the connection but not in both directions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org