[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-10-11 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-5677:
---
Affects Version/s: (was: 3.0.0-alpha1)
   2.8.0

> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch, YARN-5677.002.patch, 
> YARN-5677.003.patch, YARN-5677.004.patch, YARN-5677.005.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-10-06 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Attachment: YARN-5677.005.patch

And here's a quick update to address the checkstyle complaint.

> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch, YARN-5677.002.patch, 
> YARN-5677.003.patch, YARN-5677.004.patch, YARN-5677.005.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-10-06 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Attachment: YARN-5677.004.patch

Patch to address comments.

> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch, YARN-5677.002.patch, 
> YARN-5677.003.patch, YARN-5677.004.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-10-05 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Attachment: YARN-5677.003.patch

Here's a patch that adds tests.

> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch, YARN-5677.002.patch, 
> YARN-5677.003.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-10-05 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Attachment: YARN-5677.002.patch

This patch addresses the race.  I was not planning to tackle the 
{{ZKRMStateStore.VerifyActiveStatusThread}} issues in this patch.  Let's work 
out the right thing to do on YARN-5694 and resolve it there.

> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch, YARN-5677.002.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-09-30 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Description: 
In trunk, there is no maximum number of retries that I see.  It appears the 
connection will be retried forever, with the active never figuring out it's no 
longer active.  In my testing, the active-active state lasted almost 2 hours 
with no sign of stopping before I killed it.  The solution appears to be to cap 
the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. 
 If the active doesn't know it's not active, it will buffer up job submissions 
until it finally realizes it has become the standby. Then it will fail all the 
job submissions in bulk. In high-volume workflows, that behavior can create 
huge mass job failures.

This issue is also important because the node managers will not fail over to 
the new active until the old active realizes it's the standby.  Workloads 
submitted after the old active loses contact with ZK will therefore fail to be 
executed regardless of which RM the clients contact.

  was:
Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
contact with the ZK node(s).

In branch-2.7, the RM will retry the connection 1000 times by default.  
Attempting to contact a node which cannot be reached is slow, which means the 
active can take over an hour to realize it is no longer active.  I clocked it 
at about an hour and a half in my tests.  The solution appears to be to add 
some time awareness into the retry loop.

In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
appears the connection will be retried forever, with the active never figuring 
out it's no longer active.  In my testing, the active-active state lasted 
almost 2 hours with no sign of stopping before I killed it.  The solution 
appears to be to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. 
 If the active doesn't know it's not active, it will buffer up job submissions 
until it finally realizes it has become the standby. Then it will fail all the 
job submissions in bulk. In high-volume workflows, that behavior can create 
huge mass job failures.

This issue is also important because the node managers will not fail over to 
the new active until the old active realizes it's the standby.  Workloads 
submitted after the old active loses contact with ZK will therefore fail to be 
executed regardless of which RM the clients contact.


> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch
>
>
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-09-28 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Attachment: YARN-5677.001.patch

This patch fixes the issue in trunk.  I opted to be conservative and wait out 
the ZK session timeout rather than failing over immediately.  The delay extends 
the period of time that the cluster is in active-active, but it hopefully 
reduces jitter in the face of minor network disturbances.

I'll need to post an entirely different fix for branch-2.7.  Should I open a 
second JIRA for that?

> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3, 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
> Attachments: YARN-5677.001.patch
>
>
> Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
> contact with the ZK node(s).
> In branch-2.7, the RM will retry the connection 1000 times by default.  
> Attempting to contact a node which cannot be reached is slow, which means the 
> active can take over an hour to realize it is no longer active.  I clocked it 
> at about an hour and a half in my tests.  The solution appears to be to add 
> some time awareness into the retry loop.
> In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
> appears the connection will be retried forever, with the active never 
> figuring out it's no longer active.  In my testing, the active-active state 
> lasted almost 2 hours with no sign of stopping before I killed it.  The 
> solution appears to be to cap the number of retries or amount of time spent 
> retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-09-26 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Description: 
Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
contact with the ZK node(s).

In branch-2.7, the RM will retry the connection 1000 times by default.  
Attempting to contact a node which cannot be reached is slow, which means the 
active can take over an hour to realize it is no longer active.  I clocked it 
at about an hour and a half in my tests.  The solution appears to be to add 
some time awareness into the retry loop.

In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
appears the connection will be retried forever, with the active never figuring 
out it's no longer active.  In my testing, the active-active state lasted 
almost 2 hours with no sign of stopping before I killed it.  The solution 
appears to be to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. 
 If the active doesn't know it's not active, it will buffer up job submissions 
until it finally realizes it has become the standby. Then it will fail all the 
job submissions in bulk. In high-volume workflows, that behavior can create 
huge mass job failures.

This issue is also important because the node managers will not fail over to 
the new active until the old active realizes it's the standby.  Workloads 
submitted after the old active loses contact with ZK will therefore fail to be 
executed regardless of which RM the clients contact.

  was:
Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
contact with the ZK node(s).

In branch-2.7, the RM will retry the connection 1000 times by default.  
Attempting to contact a node which cannot be reached is slow, which means the 
active can take over an hour to realize it is no longer active.  I clocked it 
at about an hour and a half in my tests.  The solution appears to be to add 
some time awareness into the retry loop.

In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
appears the connection will be retried forever, with the active never figuring 
out it's no longer active.  I have a test running, and I'll update this 
description with empirical findings when I'm done.  The solution appears to be 
to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. 
 If the active doesn't know it's not active, it will buffer up job submissions 
until it finally realizes it has become the standby. Then it will fail all the 
job submissions in bulk. In high-volume workflows, that behavior can create 
huge mass job failures.

This issue is also important because the node managers will not fail over to 
the new active until the old active realizes it's the standby.  Workloads 
submitted after the old active loses contact with ZK will therefore fail to be 
executed regardless of which RM the clients contact.


> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3, 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>
> Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
> contact with the ZK node(s).
> In branch-2.7, the RM will retry the connection 1000 times by default.  
> Attempting to contact a node which cannot be reached is slow, which means the 
> active can take over an hour to realize it is no longer active.  I clocked it 
> at about an hour and a half in my tests.  The solution appears to be to add 
> some time awareness into the retry loop.
> In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
> appears the connection will be retried forever, with the active never 
> figuring out it's no longer active.  In my testing, the active-active state 
> lasted almost 2 hours with no sign of stopping before I killed it.  The 
> solution appears to be to cap the number of retries or amount of time spent 
> retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  

[jira] [Updated] (YARN-5677) RM can be in active-active state for an extended period

2016-09-26 Thread Daniel Templeton (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Templeton updated YARN-5677:
---
Description: 
Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
contact with the ZK node(s).

In branch-2.7, the RM will retry the connection 1000 times by default.  
Attempting to contact a node which cannot be reached is slow, which means the 
active can take over an hour to realize it is no longer active.  I clocked it 
at about an hour and a half in my tests.  The solution appears to be to add 
some time awareness into the retry loop.

In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
appears the connection will be retried forever, with the active never figuring 
out it's no longer active.  I have a test running, and I'll update this 
description with empirical findings when I'm done.  The solution appears to be 
to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. 
 If the active doesn't know it's not active, it will buffer up job submissions 
until it finally realizes it has become the standby. Then it will fail all the 
job submissions in bulk. In high-volume workflows, that behavior can create 
huge mass job failures.

This issue is also important because the node managers will not fail over to 
the new active until the old active realizes it's the standby.  Workloads 
submitted after the old active loses contact with ZK will therefore fail to be 
executed regardless of which RM the clients contact.

  was:
Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
contact with the ZK node(s).

In branch-2.7, the RM will retry the connection 1000 times by default.  
Attempting to contact a node which cannot be reached is slow, which means the 
active can take over an hour to realize it is no longer active.  I clocked it 
at about an hour and a half in my tests.  The solution appears to be to add 
some time awareness into the retry loop.

In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
appears the connection will be retried forever, with the active never figuring 
out it's no longer active.  I have a test running, and I'll update this 
description with empirical findings when I'm done.  The solution appears to be 
to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. 
 If the active doesn't know it's not active, it will buffer up job submissions 
until it finally realizes it has become the standby. Then it will fail all the 
job submissions in bulk. In high-volume workflows, that behavior can create 
huge mass job failures.


> RM can be in active-active state for an extended period
> ---
>
> Key: YARN-5677
> URL: https://issues.apache.org/jira/browse/YARN-5677
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3, 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Critical
>
> Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
> contact with the ZK node(s).
> In branch-2.7, the RM will retry the connection 1000 times by default.  
> Attempting to contact a node which cannot be reached is slow, which means the 
> active can take over an hour to realize it is no longer active.  I clocked it 
> at about an hour and a half in my tests.  The solution appears to be to add 
> some time awareness into the retry loop.
> In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
> appears the connection will be retried forever, with the active never 
> figuring out it's no longer active.  I have a test running, and I'll update 
> this description with empirical findings when I'm done.  The solution appears 
> to be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: