[jira] [Commented] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-31 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897046#comment-16897046
 ] 

Hadoop QA commented on YARN-9712:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
44s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 16s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 20s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 47s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}168m 36s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9712 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12976312/YARN-9712.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux b41bd1bf8b90 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d4ab9ae |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/24440/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24440/testReport/ |
| Max. process+thread count | 911 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 

[jira] [Commented] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-31 Thread Tarun Parimi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896947#comment-16896947
 ] 

Tarun Parimi commented on YARN-9712:


Thanks [~Prabhu Joseph] . Yes, it looks like the branch I was checking was 
missing YARN-4593 which should prevent this issue.

> ResourceManager goes into a deadlock while transitioning to standby
> ---
>
> Key: YARN-9712
> URL: https://issues.apache.org/jira/browse/YARN-9712
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, RM
>Affects Versions: 2.9.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9712.001.patch
>
>
> We have observed RM go into a deadlock while transitioning to standby in a 
> heavily loaded production cluster which can observe random connection loss to 
> a zookeeper session and also has a large amount of RMDelegationToken requests 
> due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below 
> sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
> transitionToStandby . This transitionToStandby is a synchronized method and 
> so will acquire a lock on ResourceManager. 
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. 
> Entering neutral mode and rejoining... 
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
> state 
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in 
> RMStateStore while removing/storing RMDelegationToken. This is because 
> RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
> and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
> operation failed 
> java.lang.InterruptedException 
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
> will be sent. 
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by java.lang.InterruptedException 
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . 
> This also needs a lock on ResourceManager since its a synchronized method. 
> This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler {
> @Override
> public void handle(RMFatalEvent event) {
>   LOG.error("Received " + event);
>   if (HAUtil.isHAEnabled(getConfig())) {
> // If we're in an HA config, the right answer is always to go into
> // standby.
> LOG.warn("Transitioning the resource manager to standby.");
> handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of 
> rmDispatcher is blocked. This causes a deadlock and RM will not become active 
> until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x7fea473b2800 
> nid=0x2f411 in Object.wait() [0x7fda5bef5000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0x7fdb6c5059a0> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1319)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x7fdb6c538ca0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
> - locked <0x7fdb33e418f0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355)
> - locked 

[jira] [Commented] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-31 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896943#comment-16896943
 ] 

Prabhu Joseph commented on YARN-9712:
-

Hi [~tarunparimi], Thanks for sharing detailed analysis. The patch looks good. 
But this issue will be fixed by YARN-4593.

> ResourceManager goes into a deadlock while transitioning to standby
> ---
>
> Key: YARN-9712
> URL: https://issues.apache.org/jira/browse/YARN-9712
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, RM
>Affects Versions: 2.9.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9712.001.patch
>
>
> We have observed RM go into a deadlock while transitioning to standby in a 
> heavily loaded production cluster which can observe random connection loss to 
> a zookeeper session and also has a large amount of RMDelegationToken requests 
> due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below 
> sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
> transitionToStandby . This transitionToStandby is a synchronized method and 
> so will acquire a lock on ResourceManager. 
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. 
> Entering neutral mode and rejoining... 
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
> state 
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in 
> RMStateStore while removing/storing RMDelegationToken. This is because 
> RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
> and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
> operation failed 
> java.lang.InterruptedException 
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
> will be sent. 
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by java.lang.InterruptedException 
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . 
> This also needs a lock on ResourceManager since its a synchronized method. 
> This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler {
> @Override
> public void handle(RMFatalEvent event) {
>   LOG.error("Received " + event);
>   if (HAUtil.isHAEnabled(getConfig())) {
> // If we're in an HA config, the right answer is always to go into
> // standby.
> LOG.warn("Transitioning the resource manager to standby.");
> handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of 
> rmDispatcher is blocked. This causes a deadlock and RM will not become active 
> until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x7fea473b2800 
> nid=0x2f411 in Object.wait() [0x7fda5bef5000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0x7fdb6c5059a0> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1319)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x7fdb6c538ca0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
> - locked <0x7fdb33e418f0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355)
> - locked 

[jira] [Commented] (YARN-9712) ResourceManager goes into a deadlock while transitioning to standby

2019-07-29 Thread Tarun Parimi (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895033#comment-16895033
 ] 

Tarun Parimi commented on YARN-9712:


bq. 2. While transitioning to standby, a java.lang.InterruptedException occurs 
in RMStateStore while removing/storing RMDelegationToken. This is because 
RMSecretManagerService will be stopped while transitioning to standby.
Looks like this scenario can prevented with the fix in YARN-6647. 

> ResourceManager goes into a deadlock while transitioning to standby
> ---
>
> Key: YARN-9712
> URL: https://issues.apache.org/jira/browse/YARN-9712
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, RM
>Affects Versions: 2.9.0
>Reporter: Tarun Parimi
>Priority: Major
>
> We have observed RM go into a deadlock while transitioning to standby in a 
> heavily loaded production cluster which can observe random connection loss to 
> a zookeeper session and also has a large amount of RMDelegationToken requests 
> due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below 
> sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do 
> transitionToStandby . This transitionToStandby is a synchronized method and 
> so will acquire a lock on ResourceManager. 
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. 
> Entering neutral mode and rejoining... 
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager 
> (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby 
> state 
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in 
> RMStateStore while removing/storing RMDelegationToken. This is because 
> RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken 
> and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore 
> (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store 
> operation failed 
> java.lang.InterruptedException 
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED 
> will be sent. 
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(767)) - Received RMFatalEvent of type 
> STATE_STORE_FENCED, caused by java.lang.InterruptedException 
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . 
> This also needs a lock on ResourceManager since its a synchronized method. 
> This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler {
> @Override
> public void handle(RMFatalEvent event) {
>   LOG.error("Received " + event);
>   if (HAUtil.isHAEnabled(getConfig())) {
> // If we're in an HA config, the right answer is always to go into
> // standby.
> LOG.warn("Transitioning the resource manager to standby.");
> handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of 
> rmDispatcher is blocked. This causes a deadlock and RM will not become active 
> until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x7fea473b2800 
> nid=0x2f411 in Object.wait() [0x7fda5bef5000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0x7fdb6c5059a0> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1319)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
> at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x7fdb6c538ca0> (a java.lang.Object)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
> - locked <0x7fdb33e418f0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
> at 
>