[jira] [Commented] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-06-11 Thread niranda perera (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326203#comment-15326203
 ] 

niranda perera commented on SPARK-14736:


Hi guys, 

Any update on this? We are seeing this deadlock in our custom recovery mode 
impl quite often. 

Best

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248693#comment-15248693
 ] 

Apache Spark commented on SPARK-14736:
--

User 'nirandaperera' has created a pull request for this issue:
https://github.com/apache/spark/pull/12506

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org