Colin Scott created SPARK-9256: ---------------------------------- Summary: Message delay causes Master crash upon registering application Key: SPARK-9256 URL: https://issues.apache.org/jira/browse/SPARK-9256 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Colin Scott Priority: Minor
This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and I believe it is only possible to trigger in production when the AppClient and Master are on different machines. As part of initialization, the AppClient [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124] with the Master by repeatedly sending a RegisterApplication message until it receives a RegisteredApplication response. If the RegisteredApplication response is delayed by at least REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the RegisterApplication RPC), it is possible for the Master to receive *two* RegisterApplication messages for the same AppClient. Upon receiving the second RegisterApplication message, the master [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274] to persist the ApplicationInfo to disk. Since the file already exists, FileSystemPersistenceEngine [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59] an IllegalStateException, and the Master crashes. Incidentally, it appears that there is already a [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266] in the code to handle this scenario. I have a reproducing scenario for this bug on an old version of Spark (1.0.1), but upon inspecting the latest version of the code it appears that it is still possible to trigger it. Let me know if you would like reproducing steps for triggering it on the old version of Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org