I haven't looked closely at this, but I think your proposal makes sense.

On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <niranda.per...@gmail.com>
wrote:

> Hi guys,
>
> Any update on this?
>
> Best
>
> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera <niranda.per...@gmail.com
> > wrote:
>
>> Hi all,
>>
>> I have encountered a small issue in the standalone recovery mode.
>>
>> Let's say there was an application A running in the cluster. Due to some
>> issue, the entire cluster, together with the application A goes down.
>>
>> Then later on, cluster comes back online, and the master then goes into
>> the 'recovering' mode, because it sees some apps, workers and drivers have
>> already been in the cluster from Persistence Engine. While in the recovery
>> process, the application comes back online, but now it would have a
>> different ID, let's say B.
>>
>> But then, as per the master, application registration logic, this
>> application B will NOT be added to the 'waitingApps' with the message
>> ""Attempted to re-register application at same address". [1]
>>
>>   private def registerApplication(app: ApplicationInfo): Unit = {
>>     val appAddress = app.driver.address
>>     if (addressToApp.contains(appAddress)) {
>>       logInfo("Attempted to re-register application at same address: " +
>> appAddress)
>>       return
>>     }
>>
>>
>> The problem here is, master is trying to recover application A, which is
>> not in there anymore. Therefore after the recovery process, app A will be
>> dropped. However app A's successor, app B was also omitted from the
>> 'waitingApps' list because it had the same address as App A previously.
>>
>> This creates a deadlock in the cluster, app A nor app B is available in
>> the cluster.
>>
>> When the master is in the RECOVERING mode, shouldn't it add all the
>> registering apps to a list first, and then after the recovery is completed
>> (once the unsuccessful recoveries are removed), deploy the apps which are
>> new?
>>
>> This would sort this deadlock IMO?
>>
>> look forward to hearing from you.
>>
>> best
>>
>> [1]
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
>>
>> --
>> Niranda
>> @n1r44 <https://twitter.com/N1R44>
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>

Reply via email to