[ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14736:
------------------------------------

    Assignee: Apache Spark

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-14736
>                 URL: https://issues.apache.org/jira/browse/SPARK-14736
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.1, 1.5.0, 1.6.0
>         Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>            Reporter: niranda perera
>            Assignee: Apache Spark
>            Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
>     val appAddress = app.driver.address
>     if (addressToApp.contains(appAddress)) {
>       logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>       return
>     }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to