[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259320#comment-14259320
 ] 

Varun Saxena commented on YARN-2936:


bq. Maybe a simply way is to do this.
Hmm...Sounds better. Will make the change.

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2936:
---
Attachment: YARN-2936.002.patch

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2936:
---
Attachment: (was: YARN-2936.002.patch)

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2936:
---
Attachment: YARN-2936.002.patch

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259342#comment-14259342
 ] 

Hadoop QA commented on YARN-2936:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689215/YARN-2936.002.patch
  against trunk revision 1454efe.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6198//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6198//console

This message is automatically generated.

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259347#comment-14259347
 ] 

Hudson commented on YARN-2992:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #54 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/54/])
YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik 
Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java


> ZKRMStateStore crashes due to session expiry
> 
>
> Key: YARN-2992
> URL: https://issues.apache.org/jira/browse/YARN-2992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259348#comment-14259348
 ] 

Hudson commented on YARN-2993:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #54 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/54/])
YARN-2993. Several fixes (missing acl check, error log msg ...) and some 
refinement in AdminService. (Contributed by Yi Liu) (junping_du: rev 
40ee4bff65b2bfdabfd16ee7d9be3382a0476565)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* hadoop-yarn-project/CHANGES.txt


> Several fixes (missing acl check, error log msg ...) and some refinement in 
> AdminService
> 
>
> Key: YARN-2993
> URL: https://issues.apache.org/jira/browse/YARN-2993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: YARN-2993.001.patch
>
>
> This JIRA is to resolve following issues in 
> {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
> *1.* There is no ACLs check for {{refreshServiceAcls}}
> *2.* log message in {{refreshAdminAcls}} is incorrect, it should be "... Can 
> not refresh Admin ACLs."" instead of "... Can not refresh user-groups.""
> *3.* some unnecessary header import.
> *4.* {code}
> if (!isRMActive()) {
>   RMAuditLogger.logFailure(user.getShortUserName(), argName,
>   adminAcl.toString(), "AdminService",
>   "ResourceManager is not active. Can not remove labels.");
>   throwStandbyException();
> }
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.
> *5.* {code}
> LOG.info("Exception remove labels", ioe);
> RMAuditLogger.logFailure(user.getShortUserName(), argName,
> adminAcl.toString(), "AdminService", "Exception remove label");
> throw RPCUtil.getRemoteException(ioe);
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259354#comment-14259354
 ] 

Hudson commented on YARN-2993:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #788 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/788/])
YARN-2993. Several fixes (missing acl check, error log msg ...) and some 
refinement in AdminService. (Contributed by Yi Liu) (junping_du: rev 
40ee4bff65b2bfdabfd16ee7d9be3382a0476565)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java


> Several fixes (missing acl check, error log msg ...) and some refinement in 
> AdminService
> 
>
> Key: YARN-2993
> URL: https://issues.apache.org/jira/browse/YARN-2993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: YARN-2993.001.patch
>
>
> This JIRA is to resolve following issues in 
> {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
> *1.* There is no ACLs check for {{refreshServiceAcls}}
> *2.* log message in {{refreshAdminAcls}} is incorrect, it should be "... Can 
> not refresh Admin ACLs."" instead of "... Can not refresh user-groups.""
> *3.* some unnecessary header import.
> *4.* {code}
> if (!isRMActive()) {
>   RMAuditLogger.logFailure(user.getShortUserName(), argName,
>   adminAcl.toString(), "AdminService",
>   "ResourceManager is not active. Can not remove labels.");
>   throwStandbyException();
> }
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.
> *5.* {code}
> LOG.info("Exception remove labels", ioe);
> RMAuditLogger.logFailure(user.getShortUserName(), argName,
> adminAcl.toString(), "AdminService", "Exception remove label");
> throw RPCUtil.getRemoteException(ioe);
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259353#comment-14259353
 ] 

Hudson commented on YARN-2992:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #788 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/788/])
YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik 
Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* hadoop-yarn-project/CHANGES.txt


> ZKRMStateStore crashes due to session expiry
> 
>
> Key: YARN-2992
> URL: https://issues.apache.org/jira/browse/YARN-2992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately

2014-12-27 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2958:
---
Summary: RMStateStore seems to unnecessarily and wrongly store sequence 
number separately  (was: RMStateStore seems to unnecessarily and wronly store 
sequence number separately)

> RMStateStore seems to unnecessarily and wrongly store sequence number 
> separately
> 
>
> Key: YARN-2958
> URL: https://issues.apache.org/jira/browse/YARN-2958
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
>Priority: Blocker
> Attachments: YARN-2958.001.patch, YARN-2958.002.patch, 
> YARN-2958.003.patch
>
>
> It seems that RMStateStore updates last sequence number when storing or 
> updating each individual DT, to recover the latest sequence number when RM 
> restarting.
> First, the current logic seems to be problematic:
> {code}
>   public synchronized void updateRMDelegationTokenAndSequenceNumber(
>   RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate,
>   int latestSequenceNumber) {
> if(isFencedState()) {
>   LOG.info("State store is in Fenced state. Can't update RM Delegation 
> Token.");
>   return;
> }
> try {
>   updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, 
> renewDate,
>   latestSequenceNumber);
> } catch (Exception e) {
>   notifyStoreOperationFailed(e);
> }
>   }
> {code}
> {code}
>   @Override
>   protected void updateStoredToken(RMDelegationTokenIdentifier id,
>   long renewDate) {
> try {
>   LOG.info("updating RMDelegation token with sequence number: "
>   + id.getSequenceNumber());
>   rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id,
> renewDate, id.getSequenceNumber());
> } catch (Exception e) {
>   LOG.error("Error in updating persisted RMDelegationToken with sequence 
> number: "
> + id.getSequenceNumber());
>   ExitUtil.terminate(1, e);
> }
>   }
> {code}
> According to code above, even when renewing a DT, the last sequence number is 
> updated in the store, which is wrong. For example, we have the following 
> sequence:
> 1. Get DT 1 (seq = 1)
> 2. Get DT 2( seq = 2)
> 3. Renew DT 1 (seq = 1)
> 4. Restart RM
> The stored and then recovered last sequence number is 1. It makes the next 
> created DT after RM restarting will conflict with DT 2 on sequence num.
> Second, the aforementioned bug doesn't happen actually, because the recovered 
> last sequence num has been overwritten at by the correctly one.
> {code}
>   public void recover(RMState rmState) throws Exception {
> LOG.info("recovering RMDelegationTokenSecretManager.");
> // recover RMDTMasterKeys
> for (DelegationKey dtKey : rmState.getRMDTSecretManagerState()
>   .getMasterKeyState()) {
>   addKey(dtKey);
> }
> // recover RMDelegationTokens
> Map rmDelegationTokens =
> rmState.getRMDTSecretManagerState().getTokenState();
> this.delegationTokenSequenceNumber =
> rmState.getRMDTSecretManagerState().getDTSequenceNumber();
> for (Map.Entry entry : 
> rmDelegationTokens
>   .entrySet()) {
>   addPersistedDelegationToken(entry.getKey(), entry.getValue());
> }
>   }
> {code}
> The code above recovers delegationTokenSequenceNumber by reading the last 
> sequence number in the store. It could be wrong. Fortunately, 
> delegationTokenSequenceNumber updates it to the right number.
> {code}
> if (identifier.getSequenceNumber() > getDelegationTokenSeqNum()) {
>   setDelegationTokenSeqNum(identifier.getSequenceNumber());
> }
> {code}
> All the stored identifiers will be gone through, and 
> delegationTokenSequenceNumber will be set to the largest sequence number 
> among these identifiers. Therefore, new DT will be assigned a sequence number 
> which is always larger than that of all the recovered DT.
> To sum up, two negatives make a positive, but it's good to fix the issue. 
> Please let me know if I've missed something here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259368#comment-14259368
 ] 

Hudson commented on YARN-2992:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1986 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1986/])
YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik 
Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* hadoop-yarn-project/CHANGES.txt


> ZKRMStateStore crashes due to session expiry
> 
>
> Key: YARN-2992
> URL: https://issues.apache.org/jira/browse/YARN-2992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259369#comment-14259369
 ] 

Hudson commented on YARN-2993:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1986 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1986/])
YARN-2993. Several fixes (missing acl check, error log msg ...) and some 
refinement in AdminService. (Contributed by Yi Liu) (junping_du: rev 
40ee4bff65b2bfdabfd16ee7d9be3382a0476565)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java


> Several fixes (missing acl check, error log msg ...) and some refinement in 
> AdminService
> 
>
> Key: YARN-2993
> URL: https://issues.apache.org/jira/browse/YARN-2993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: YARN-2993.001.patch
>
>
> This JIRA is to resolve following issues in 
> {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
> *1.* There is no ACLs check for {{refreshServiceAcls}}
> *2.* log message in {{refreshAdminAcls}} is incorrect, it should be "... Can 
> not refresh Admin ACLs."" instead of "... Can not refresh user-groups.""
> *3.* some unnecessary header import.
> *4.* {code}
> if (!isRMActive()) {
>   RMAuditLogger.logFailure(user.getShortUserName(), argName,
>   adminAcl.toString(), "AdminService",
>   "ResourceManager is not active. Can not remove labels.");
>   throwStandbyException();
> }
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.
> *5.* {code}
> LOG.info("Exception remove labels", ioe);
> RMAuditLogger.logFailure(user.getShortUserName(), argName,
> adminAcl.toString(), "AdminService", "Exception remove label");
> throw RPCUtil.getRemoteException(ioe);
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259373#comment-14259373
 ] 

Hudson commented on YARN-2992:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #51 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/51/])
YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik 
Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* hadoop-yarn-project/CHANGES.txt


> ZKRMStateStore crashes due to session expiry
> 
>
> Key: YARN-2992
> URL: https://issues.apache.org/jira/browse/YARN-2992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259374#comment-14259374
 ] 

Hudson commented on YARN-2993:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #51 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/51/])
YARN-2993. Several fixes (missing acl check, error log msg ...) and some 
refinement in AdminService. (Contributed by Yi Liu) (junping_du: rev 
40ee4bff65b2bfdabfd16ee7d9be3382a0476565)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* hadoop-yarn-project/CHANGES.txt


> Several fixes (missing acl check, error log msg ...) and some refinement in 
> AdminService
> 
>
> Key: YARN-2993
> URL: https://issues.apache.org/jira/browse/YARN-2993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: YARN-2993.001.patch
>
>
> This JIRA is to resolve following issues in 
> {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
> *1.* There is no ACLs check for {{refreshServiceAcls}}
> *2.* log message in {{refreshAdminAcls}} is incorrect, it should be "... Can 
> not refresh Admin ACLs."" instead of "... Can not refresh user-groups.""
> *3.* some unnecessary header import.
> *4.* {code}
> if (!isRMActive()) {
>   RMAuditLogger.logFailure(user.getShortUserName(), argName,
>   adminAcl.toString(), "AdminService",
>   "ResourceManager is not active. Can not remove labels.");
>   throwStandbyException();
> }
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.
> *5.* {code}
> LOG.info("Exception remove labels", ioe);
> RMAuditLogger.logFailure(user.getShortUserName(), argName,
> adminAcl.toString(), "AdminService", "Exception remove label");
> throw RPCUtil.getRemoteException(ioe);
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259382#comment-14259382
 ] 

Hudson commented on YARN-2993:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #55 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/55/])
YARN-2993. Several fixes (missing acl check, error log msg ...) and some 
refinement in AdminService. (Contributed by Yi Liu) (junping_du: rev 
40ee4bff65b2bfdabfd16ee7d9be3382a0476565)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java


> Several fixes (missing acl check, error log msg ...) and some refinement in 
> AdminService
> 
>
> Key: YARN-2993
> URL: https://issues.apache.org/jira/browse/YARN-2993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: YARN-2993.001.patch
>
>
> This JIRA is to resolve following issues in 
> {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
> *1.* There is no ACLs check for {{refreshServiceAcls}}
> *2.* log message in {{refreshAdminAcls}} is incorrect, it should be "... Can 
> not refresh Admin ACLs."" instead of "... Can not refresh user-groups.""
> *3.* some unnecessary header import.
> *4.* {code}
> if (!isRMActive()) {
>   RMAuditLogger.logFailure(user.getShortUserName(), argName,
>   adminAcl.toString(), "AdminService",
>   "ResourceManager is not active. Can not remove labels.");
>   throwStandbyException();
> }
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.
> *5.* {code}
> LOG.info("Exception remove labels", ioe);
> RMAuditLogger.logFailure(user.getShortUserName(), argName,
> adminAcl.toString(), "AdminService", "Exception remove label");
> throw RPCUtil.getRemoteException(ioe);
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259381#comment-14259381
 ] 

Hudson commented on YARN-2992:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #55 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/55/])
YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik 
Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* hadoop-yarn-project/CHANGES.txt


> ZKRMStateStore crashes due to session expiry
> 
>
> Key: YARN-2992
> URL: https://issues.apache.org/jira/browse/YARN-2992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259386#comment-14259386
 ] 

Hudson commented on YARN-2993:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2005 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2005/])
YARN-2993. Several fixes (missing acl check, error log msg ...) and some 
refinement in AdminService. (Contributed by Yi Liu) (junping_du: rev 
40ee4bff65b2bfdabfd16ee7d9be3382a0476565)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* hadoop-yarn-project/CHANGES.txt


> Several fixes (missing acl check, error log msg ...) and some refinement in 
> AdminService
> 
>
> Key: YARN-2993
> URL: https://issues.apache.org/jira/browse/YARN-2993
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: YARN-2993.001.patch
>
>
> This JIRA is to resolve following issues in 
> {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}:
> *1.* There is no ACLs check for {{refreshServiceAcls}}
> *2.* log message in {{refreshAdminAcls}} is incorrect, it should be "... Can 
> not refresh Admin ACLs."" instead of "... Can not refresh user-groups.""
> *3.* some unnecessary header import.
> *4.* {code}
> if (!isRMActive()) {
>   RMAuditLogger.logFailure(user.getShortUserName(), argName,
>   adminAcl.toString(), "AdminService",
>   "ResourceManager is not active. Can not remove labels.");
>   throwStandbyException();
> }
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.
> *5.* {code}
> LOG.info("Exception remove labels", ioe);
> RMAuditLogger.logFailure(user.getShortUserName(), argName,
> adminAcl.toString(), "AdminService", "Exception remove label");
> throw RPCUtil.getRemoteException(ioe);
> {code}
> is common in lots of methods, just the message is different, we should refine 
> it into one common method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry

2014-12-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259385#comment-14259385
 ] 

Hudson commented on YARN-2992:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2005 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2005/])
YARN-2992. ZKRMStateStore crashes due to session expiry. Contributed by Karthik 
Kambatla (jianhe: rev 1454efe5d4fe4214ec5ef9142d55dbeca7dab953)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java


> ZKRMStateStore crashes due to session expiry
> 
>
> Key: YARN-2992
> URL: https://issues.apache.org/jira/browse/YARN-2992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.7.0
>
> Attachments: yarn-2992-1.patch
>
>
> We recently saw the RM crash with the following stacktrace. On session 
> expiry, we should gracefully transition to standby. 
> {noformat}
> 2014-12-18 06:28:42,689 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2978) Null pointer in YarnProtos

2014-12-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259395#comment-14259395
 ] 

Varun Saxena commented on YARN-2978:


[~jtufo], thanks for raising the issue. May I know the scenario ? 
Were you trying to get queue info for the same queue from multiple clients ? 
Also which scheduler was configured ?

> Null pointer in YarnProtos
> --
>
> Key: YARN-2978
> URL: https://issues.apache.org/jira/browse/YARN-2978
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.5.1
>Reporter: Jason Tufo
>Assignee: Varun Saxena
>
>  java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625)
>   at 
> org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259545#comment-14259545
 ] 

Jian He commented on YARN-2936:
---

thanks, looks good, mind writing a unit test? TestYARNTokenIdentifier.java has 
some existing tests.

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-2936:
---
Attachment: YARN-2936.003.patch

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch, 
> YARN-2936.003.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259582#comment-14259582
 ] 

Hadoop QA commented on YARN-2936:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12689248/YARN-2936.003.patch
  against trunk revision 1454efe.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6199//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6199//console

This message is automatically generated.

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch, 
> YARN-2936.003.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now

2014-12-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259585#comment-14259585
 ] 

Varun Saxena commented on YARN-2936:


Weird. How is it -1 overall when its +1 for everything

> YARNDelegationTokenIdentifier doesn't set proto.builder now
> ---
>
> Key: YARN-2936
> URL: https://issues.apache.org/jira/browse/YARN-2936
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Varun Saxena
> Attachments: YARN-2936.001.patch, YARN-2936.002.patch, 
> YARN-2936.003.patch
>
>
> After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, 
> such that when constructing a object which extends 
> YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, 
> when we call getProto() of it, we will just get an empty proto object.
> It seems to do no harm to the production code path, as we will always call 
> getBytes() before using proto to persist the DT in the state store, when we 
> generating the password.
> I think the setter is removed to avoid duplicating setting the fields why 
> getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work 
> properly alone. YARNDelegationTokenIdentifier is tightly coupled with the 
> logic in secretManager. It's vulnerable if something is changed at 
> secretManager. For example, in the test case of YARN-2837, I spent time to 
> figure out we need to execute getBytes() first to make sure the testing DTs 
> can be properly put into the state store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)