date:20140319


 [ 
https://issues.apache.org/jira/browse/YARN-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1850:
--

Attachment: YARN-1850.1.patch

Create a patch which can disable timeline service: timeline client won't put 
entities and events to the timeline server. I set the default to true to not 
disturb the users who have already played with this feature. I've tested the 
patch locally: when the timeline service is disabled, the DS client won't put 
any data to the timeline server.

 Make enabling timeline service configurable 
 

 Key: YARN-1850
 URL: https://issues.apache.org/jira/browse/YARN-1850
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1850.1.patch


 Like generic history service, we'd better to make enabling timeline service 
 configurable, in case the timeline server is not up



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1850) Make enabling timeline service configurable


[ 
https://issues.apache.org/jira/browse/YARN-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940261#comment-13940261
 ] 

Hadoop QA commented on YARN-1850:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12635497/YARN-1850.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  org.apache.hadoop.yarn.client.TestRMFailover

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3396//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3396//console

This message is automatically generated.

 Make enabling timeline service configurable 
 

 Key: YARN-1850
 URL: https://issues.apache.org/jira/browse/YARN-1850
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1850.1.patch


 Like generic history service, we'd better to make enabling timeline service 
 configurable, in case the timeline server is not up



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs

Rohith created YARN-1852:


 Summary: Application recovery throws 
InvalidStateTransitonException for FAILED and KILLED jobs
 Key: YARN-1852
 URL: https://issues.apache.org/jira/browse/YARN-1852
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0
Reporter: Rohith
Assignee: Rohith
Priority: Minor


Recovering for failed/killed application throw InvalidStateTransitonException.

These are logged during recovery of applications.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs


[ 
https://issues.apache.org/jira/browse/YARN-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940394#comment-13940394
 ] 

Rohith commented on YARN-1852:
--

Here is the exception stack trace..

For Killed application state=KILLED
{noformat}
2014-03-19 14:26:11,618 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: 
application_1394526371652_0004 with 1 attempts and final state = KILLED
2014-03-19 14:26:11,618 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root 
OPERATION=Application Finished - Killed TARGET=RMAppManager RESULT=SUCCESS  
APPID=application_1394526371652_0003
2014-03-19 14:26:11,618 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Recovering attempt: appattempt_1394526371652_0004_01 with final state: 
KILLED
2014-03-19 14:26:11,618 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: 
appId=application_1394526371652_0003,name=Sleep 
job,user=root,queue=default,state=KILLED,trackingUrl=host-10-18-40-77:45020/cluster/app/application_1394526371652_0003,appMasterHost=N/A,startTime=1394526759247,finishTime=1394527194947,finalStatus=KILLED
2014-03-19 14:26:11,619 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1394526371652_0004_01
2014-03-19 14:26:11,619 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1394526371652_0004_01 State change from NEW to KILLED
2014-03-19 14:26:11,619 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1394526371652_0004 State change from NEW to KILLED
2014-03-19 14:26:11,619 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
ATTEMPT_KILLED at KILLED
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:632)
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:82)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:690)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:674)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:662)
{noformat}

For failed application state=FAILED
{noformat}
2014-03-19 14:26:11,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: 
application_1394528000856_0003 with 2 attempts and final state = FAILED
2014-03-19 14:26:11,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: 
appId=application_1395139734891_0003,name=Sleep 
job,user=root,queue=d,state=FINISHED,trackingUrl=http://host-10-18-40-77:45020/proxy/application_1395139734891_0003/jobhistory/job/job_1395139734891_0003,appMasterHost=N/A,startTime=1395141914653,finishTime=1395141933121,finalStatus=SUCCEEDED
2014-03-19 14:26:11,614 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Recovering attempt: appattempt_1394528000856_0003_01 with final state: 
FAILED
2014-03-19 14:26:11,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Recovering attempt: appattempt_1394528000856_0003_02 with final state: 
FAILED
2014-03-19 14:26:11,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1394528000856_0003_01 State change from NEW to FAILED
2014-03-19 14:26:11,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1394528000856_0003_02
2014-03-19 14:26:11,615 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1394528000856_0003_02 State change from NEW to FAILED
2014-03-19 14:26:11,616 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1394528000856_0003 State change from NEW to FAILED
2014-03-19 14:26:11,616 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle 
this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
ATTEMPT_FAILED at FAILED

[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell


[ 
https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940400#comment-13940400
 ] 

Hudson commented on YARN-1690:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #514 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/514/])
YARN-1690. Made DistributedShell send timeline entities+events. Contributed by 
Mayank Bansal. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579123)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


 Sending timeline entities+events from Distributed shell 
 

 Key: YARN-1690
 URL: https://issues.apache.org/jira/browse/YARN-1690
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Fix For: 2.4.0

 Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, 
 YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1705) Reset cluster-metrics on transition to standby


[ 
https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940399#comment-13940399
 ] 

Hudson commented on YARN-1705:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #514 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/514/])
YARN-1705. Reset cluster-metrics on transition to standby. (Rohith via kasha) 
(kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579014)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java


 Reset cluster-metrics on transition to standby
 --

 Key: YARN-1705
 URL: https://issues.apache.org/jira/browse/YARN-1705
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: YARN-1705.1.patch, YARN-1705.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1705) Reset cluster-metrics on transition to standby


[ 
https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940496#comment-13940496
 ] 

Hudson commented on YARN-1705:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1706 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1706/])
YARN-1705. Reset cluster-metrics on transition to standby. (Rohith via kasha) 
(kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579014)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java


 Reset cluster-metrics on transition to standby
 --

 Key: YARN-1705
 URL: https://issues.apache.org/jira/browse/YARN-1705
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: YARN-1705.1.patch, YARN-1705.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell


[ 
https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940497#comment-13940497
 ] 

Hudson commented on YARN-1690:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1706 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1706/])
YARN-1690. Made DistributedShell send timeline entities+events. Contributed by 
Mayank Bansal. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579123)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


 Sending timeline entities+events from Distributed shell 
 

 Key: YARN-1690
 URL: https://issues.apache.org/jira/browse/YARN-1690
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Fix For: 2.4.0

 Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, 
 YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell


[ 
https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940538#comment-13940538
 ] 

Hudson commented on YARN-1690:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1731 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1731/])
YARN-1690. Made DistributedShell send timeline entities+events. Contributed by 
Mayank Bansal. (zjshen: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579123)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


 Sending timeline entities+events from Distributed shell 
 

 Key: YARN-1690
 URL: https://issues.apache.org/jira/browse/YARN-1690
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Mayank Bansal
Assignee: Mayank Bansal
 Fix For: 2.4.0

 Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, 
 YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1705) Reset cluster-metrics on transition to standby


[ 
https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940537#comment-13940537
 ] 

Hudson commented on YARN-1705:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1731 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1731/])
YARN-1705. Reset cluster-metrics on transition to standby. (Rohith via kasha) 
(kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579014)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java


 Reset cluster-metrics on transition to standby
 --

 Key: YARN-1705
 URL: https://issues.apache.org/jira/browse/YARN-1705
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Rohith
 Attachments: YARN-1705.1.patch, YARN-1705.2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1833) TestRMAdminService Fails in trunk and branch-2

2014-03-19 Thread Jonathan Eagles (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated YARN-1833:
--

Fix Version/s: 2.4.0

 TestRMAdminService Fails in trunk and branch-2
 --

 Key: YARN-1833
 URL: https://issues.apache.org/jira/browse/YARN-1833
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.4.0
Reporter: Mit Desai
Assignee: Mit Desai
  Labels: Test
 Fix For: 3.0.0, 2.4.0, 2.5.0

 Attachments: YARN-1833-v2.patch, YARN-1833.patch


 In the test 
 testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the 
 following assert is not needed.
 {code}
 Assert.assertTrue(groupWithInit.size() != groupBefore.size());
 {code}
 As the assert takes the default groups for groupWithInit (which in my case 
 are users, sshusers and wheel), it fails as the size of both groupWithInit 
 and groupBefore are same.
 I do not think we need to have this assert here. Moreover we are also 
 checking that the groupInit does not have the userGroups that are in the 
 groupBefore so removing the assert may not be harmful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2

2014-03-19 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940543#comment-13940543
 ] 

Jonathan Eagles commented on YARN-1833:
---

Added this test only fix to 2.4.0 release since it is really hindering my 
testing efforts on that line. 

 TestRMAdminService Fails in trunk and branch-2
 --

 Key: YARN-1833
 URL: https://issues.apache.org/jira/browse/YARN-1833
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 2.4.0
Reporter: Mit Desai
Assignee: Mit Desai
  Labels: Test
 Fix For: 3.0.0, 2.4.0, 2.5.0

 Attachments: YARN-1833-v2.patch, YARN-1833.patch


 In the test 
 testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the 
 following assert is not needed.
 {code}
 Assert.assertTrue(groupWithInit.size() != groupBefore.size());
 {code}
 As the assert takes the default groups for groupWithInit (which in my case 
 are users, sshusers and wheel), it fails as the size of both groupWithInit 
 and groupBefore are same.
 I do not think we need to have this assert here. Moreover we are also 
 checking that the groupInit does not have the userGroups that are in the 
 groupBefore so removing the assert may not be harmful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1853) Allow containers to be ran under real user even in insecure mode

2014-03-19 Thread Andrey Stepachev (JIRA)

Andrey Stepachev created YARN-1853:
--

 Summary: Allow containers to be ran under real user even in 
insecure mode
 Key: YARN-1853
 URL: https://issues.apache.org/jira/browse/YARN-1853
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Andrey Stepachev


Currently unsecure cluster runs all containers under one user (typically 
nobody). That is not appropriate, because yarn applications doesn't play well 
with hdfs having enabled permissions. Yarn applications try to write data (as 
expected) into /user/nobody regardless of user, who launched application.

Another sideeffect is that it is not possible to configure cgroups for 
particular users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1853) Allow containers to be ran under real user even in insecure mode

2014-03-19 Thread Andrey Stepachev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Stepachev updated YARN-1853:
---

Attachment: YARN-1853.patch

My propose it to use parameter 
'yarn.nodemanager.linux-container-executor.nonsecure-mode.impersonate' (with 
default to true) which will control, should yarn impersonate container in 
insecure mode, or should run it under concrete user.

 Allow containers to be ran under real user even in insecure mode
 

 Key: YARN-1853
 URL: https://issues.apache.org/jira/browse/YARN-1853
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Andrey Stepachev
 Attachments: YARN-1853.patch


 Currently unsecure cluster runs all containers under one user (typically 
 nobody). That is not appropriate, because yarn applications doesn't play well 
 with hdfs having enabled permissions. Yarn applications try to write data (as 
 expected) into /user/nobody regardless of user, who launched application.
 Another sideeffect is that it is not possible to configure cgroups for 
 particular users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1853) Allow containers to be ran under real user even in insecure mode

2014-03-19 Thread Andrey Stepachev (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Stepachev updated YARN-1853:
---

Affects Version/s: 2.2.0

 Allow containers to be ran under real user even in insecure mode
 

 Key: YARN-1853
 URL: https://issues.apache.org/jira/browse/YARN-1853
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.2.0
Reporter: Andrey Stepachev
 Attachments: YARN-1853.patch


 Currently unsecure cluster runs all containers under one user (typically 
 nobody). That is not appropriate, because yarn applications doesn't play well 
 with hdfs having enabled permissions. Yarn applications try to write data (as 
 expected) into /user/nobody regardless of user, who launched application.
 Another sideeffect is that it is not possible to configure cgroups for 
 particular users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1854) TestRMHA#testStartAndTransitions Fails

2014-03-19 Thread Mit Desai (JIRA)

Mit Desai created YARN-1854:
---

 Summary: TestRMHA#testStartAndTransitions Fails
 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai


{noformat}
testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) 
 Time elapsed: 5.883 sec   FAILURE!
java.lang.AssertionError: Incorrect value for metric availableMB 
expected:2048 but was:4096
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)


Results :

Failed tests: 
  
TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
 Incorrect value for metric availableMB expected:2048 but was:4096
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk

2014-03-19 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated YARN-1855:
-

Summary: TestRMFailover#testRMWebAppRedirect fails in trunk  (was: 
TestRMFailover#testRMWebAppRedirect fails occasionally in trunk)

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk

2014-03-19 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940625#comment-13940625
 ] 

Ted Yu commented on YARN-1855:
--

I tried this:
{code}
Index: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
===
--- 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
  (revision 1579270)
+++ 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
  (working copy)
@@ -286,7 +286,8 @@
 try {
   MapString, ListString map =
   new URL(url).openConnection().getHeaderFields();
-  fieldHeader = map.get(field).get(0);
+  ListString lst = map.get(field);
+  if (lst != null) fieldHeader = lst.get(0);
 } catch (Exception e) {
   // throw new RuntimeException(e);
 }
{code}
However, the next assertion fails:
{code}
header = getHeader(Refresh, rm2Url + /ws/v1/cluster/apps);
assertTrue(header.contains(; url= + rm1Url));
{code}
header was null.

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1854) TestRMHA#testStartAndTransitions Fails


 [ 
https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1854:
--

Priority: Blocker  (was: Major)
Target Version/s: 2.4.0

 TestRMHA#testStartAndTransitions Fails
 --

 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai
Priority: Blocker

 {noformat}
 testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)
   Time elapsed: 5.883 sec   FAILURE!
 java.lang.AssertionError: Incorrect value for metric availableMB 
 expected:2048 but was:4096
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)
 Results :
 Failed tests: 
   
 TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
  Incorrect value for metric availableMB expected:2048 but was:4096
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails occasionally in trunk

2014-03-19 Thread Ted Yu (JIRA)

Ted Yu created YARN-1855:


 Summary: TestRMFailover#testRMWebAppRedirect fails occasionally in 
trunk
 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu


From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
{code}
testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
elapsed: 5.39 sec   ERROR!
java.lang.NullPointerException: null
at 
org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940644#comment-13940644
 ] 

Zhijie Shen commented on YARN-1855:
---

I could reproduce the test failure locally as well, and saw jenkins reported it 
as well:
https://builds.apache.org/job/PreCommit-YARN-Build/3396//testReport/org.apache.hadoop.yarn.client/TestRMFailover/testRMWebAppRedirect/

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails


[ 
https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940609#comment-13940609
 ] 

Karthik Kambatla commented on YARN-1854:


YARN-1705 introduced this check - I ran it multiple times while committing it, 
and it succeeded. [~mitdesai] - are you able to reproduce this 
deterministically?

 TestRMHA#testStartAndTransitions Fails
 --

 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai

 {noformat}
 testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)
   Time elapsed: 5.883 sec   FAILURE!
 java.lang.AssertionError: Incorrect value for metric availableMB 
 expected:2048 but was:4096
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)
 Results :
 Failed tests: 
   
 TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
  Incorrect value for metric availableMB expected:2048 but was:4096
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs


 [ 
https://issues.apache.org/jira/browse/YARN-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1852:
--

Priority: Major  (was: Minor)

 Application recovery throws InvalidStateTransitonException for FAILED and 
 KILLED jobs
 -

 Key: YARN-1852
 URL: https://issues.apache.org/jira/browse/YARN-1852
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0
Reporter: Rohith
Assignee: Rohith

 Recovering for failed/killed application throw InvalidStateTransitonException.
 These are logged during recovery of applications.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk


 [ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1855:
--

Priority: Critical  (was: Major)

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Priority: Critical

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940685#comment-13940685
 ] 

Karthik Kambatla commented on YARN-1855:


[~cindyli] - will you be able to take a look at this? Otherwise, I can jump on 
it tomorrow. 

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Priority: Critical

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails

2014-03-19 Thread Mit Desai (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940707#comment-13940707
 ] 

Mit Desai commented on YARN-1854:
-

It I got that failing in our nightly builds. When I tested it on my local 
machine, I got the same error.

But now when I tried testing it again, I get the following error intermittently.

{noformat}
testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) 
 Time elapsed: 1.755 sec   FAILURE!
java.lang.AssertionError: Incorrect value for metric appsPending expected:1 
but was:0
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:384)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:154)


Results :

Failed tests: 
  
TestRMHA.testStartAndTransitions:154-verifyClusterMetrics:384-assertMetric:396
 Incorrect value for metric appsPending expected:1 but was:0
{noformat}

 TestRMHA#testStartAndTransitions Fails
 --

 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai
Priority: Blocker

 {noformat}
 testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)
   Time elapsed: 5.883 sec   FAILURE!
 java.lang.AssertionError: Incorrect value for metric availableMB 
 expected:2048 but was:4096
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)
 Results :
 Failed tests: 
   
 TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
  Incorrect value for metric availableMB expected:2048 but was:4096
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1842) InvalidApplicationMasterRequestException raised during AM-requested shutdown

2014-03-19 Thread Janos Matyas (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940705#comment-13940705
 ] 

Janos Matyas commented on YARN-1842:


Hi,

This seems to be an issue on OS/X and Debian only. We have just tried on CentOS 
(for automatic Hoya install on CentOS feel free to use this script - 
https://github.com/sequenceiq/hadoop-docker/blob/master/hoya-centos-install.sh) 
and it works fine launching HBase containers. 

Also we have tried our custom Apache Flume provider 
(https://github.com/sequenceiq/hoya) and it works well - launching and stoping 
containers as supposed. 

A quick note: on Debian and OS/X there are different exceptions if you launch 
the containers using IP address or localhost (hoya create hbase --role master 1 
--role worker 1 --manager localhost:8032 --filesystem hdfs://localhost:9000 
--image hdfs://localhost:9000/hbase.tar.gz --appconf 
file:///tmp/hoya-master/hoya-core/src/main/resources/org/apache/hoya/providers/hbase/conf
 --zkhosts localhost)



 InvalidApplicationMasterRequestException raised during AM-requested shutdown
 

 Key: YARN-1842
 URL: https://issues.apache.org/jira/browse/YARN-1842
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Steve Loughran
Priority: Minor
 Attachments: hoyalogs.tar.gz


 Report of the RM raising a stack trace 
 [https://gist.github.com/matyix/9596735] during AM-initiated shutdown. The AM 
 could just swallow this and exit, but it could be a sign of a race condition 
 YARN-side, or maybe just in the RM client code/AM dual signalling the 
 shutdown. 
 I haven't replicated this myself; maybe the stack will help track down the 
 problem. Otherwise: what is the policy YARN apps should adopt for AM's 
 handling errors on shutdown? go straight to an exit(-1)?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails


[ 
https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940730#comment-13940730
 ] 

Karthik Kambatla commented on YARN-1854:


Thanks Mit. It is likely a race in the test. [~rohithsharma] - will you be able 
to look into this? Otherwise, I ll be able to jump on it tomorrow. 

 TestRMHA#testStartAndTransitions Fails
 --

 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai
Priority: Blocker

 {noformat}
 testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)
   Time elapsed: 5.883 sec   FAILURE!
 java.lang.AssertionError: Incorrect value for metric availableMB 
 expected:2048 but was:4096
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)
 Results :
 Failed tests: 
   
 TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
  Incorrect value for metric availableMB expected:2048 but was:4096
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940746#comment-13940746
 ] 

Hadoop QA commented on YARN-1849:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12635573/yarn-1849-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3397//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3397//console

This message is automatically generated.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1850) Make enabling timeline service configurable


[ 
https://issues.apache.org/jira/browse/YARN-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940646#comment-13940646
 ] 

Zhijie Shen commented on YARN-1850:
---

The test failure is not related. See YARN-1855.

 Make enabling timeline service configurable 
 

 Key: YARN-1850
 URL: https://issues.apache.org/jira/browse/YARN-1850
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1850.1.patch


 Like generic history service, we'd better to make enabling timeline service 
 configurable, in case the timeline server is not up



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


 [ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1849:
---

Attachment: yarn-1849-1.patch

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1851) Unable to parse launch time from job history file

2014-03-19 Thread Akira AJISAKA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA reassigned YARN-1851:
---

Assignee: Akira AJISAKA

 Unable to parse launch time from job history file
 -

 Key: YARN-1851
 URL: https://issues.apache.org/jira/browse/YARN-1851
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Fengdong Yu
Assignee: Akira AJISAKA
Priority: Minor
 Fix For: 2.4.0


 when job complete, there are WARN complains in the log:
 {code}
 2014-03-19 13:31:10,036 WARN 
 org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse 
 launch time from job history file 
 job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist
  : java.lang.NumberFormatException: For input string: queue
 {code}
 because  there is (-)  in the queue name 'test-queue', we split the job 
 history file name by (-), and get the ninth item as job start time.
 FileNameIndexUtils.java
 {code}
 private static final int JOB_START_TIME_INDEX = 9;
 {code}
 but there is another potential issue:
 if I also include '-' in the job name(test_one_world in this case), there are 
 all misunderstand.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk


 [ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1855:
--

Target Version/s: 2.4.0

I believe this also affects 2.4.

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Priority: Critical

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1851) Unable to parse launch time from job history file

2014-03-19 Thread Akira AJISAKA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940779#comment-13940779
 ] 

Akira AJISAKA commented on YARN-1851:
-

I looked around the code and found that user name and job name are escaped, but 
queue name is not escaped. I'll create a patch to escape queue name shortly.

 Unable to parse launch time from job history file
 -

 Key: YARN-1851
 URL: https://issues.apache.org/jira/browse/YARN-1851
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Fengdong Yu
Assignee: Akira AJISAKA
Priority: Minor
 Fix For: 2.4.0


 when job complete, there are WARN complains in the log:
 {code}
 2014-03-19 13:31:10,036 WARN 
 org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse 
 launch time from job history file 
 job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist
  : java.lang.NumberFormatException: For input string: queue
 {code}
 because  there is (-)  in the queue name 'test-queue', we split the job 
 history file name by (-), and get the ninth item as job start time.
 FileNameIndexUtils.java
 {code}
 private static final int JOB_START_TIME_INDEX = 9;
 {code}
 but there is another potential issue:
 if I also include '-' in the job name(test_one_world in this case), there are 
 all misunderstand.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940640#comment-13940640
 ] 

Karthik Kambatla commented on YARN-1849:


This time around, it turns out the master container is null:
{code}
  if (rmAppAttempt != null) {
if (rmAppAttempt.getMasterContainer().getId()
.equals(containerStatus.getContainerId())
 containerStatus.getState() == ContainerState.COMPLETE) 
{code}
Looks like it is not necessary for an UnmanagedAM to have a master container. 

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker

 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs

2014-03-19 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940822#comment-13940822
 ] 

Jian He commented on YARN-1852:
---

This seems most likely due to , we are replaying the attempt's 
BaseFinalTransition logic which causes sending a new FAILED/KILLED event, while 
RMApp already moves to FAILED/KILLED state.  We covered the case for FINISHED 
state but it seems we miss this.

 Application recovery throws InvalidStateTransitonException for FAILED and 
 KILLED jobs
 -

 Key: YARN-1852
 URL: https://issues.apache.org/jira/browse/YARN-1852
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0, 2.4.0
Reporter: Rohith
Assignee: Rohith

 Recovering for failed/killed application throw InvalidStateTransitonException.
 These are logged during recovery of applications.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1856) cgroups based memory monitoring for containers

Karthik Kambatla created YARN-1856:
--

 Summary: cgroups based memory monitoring for containers
 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1747) Better physical memory monitoring for containers


[ 
https://issues.apache.org/jira/browse/YARN-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940832#comment-13940832
 ] 

Karthik Kambatla commented on YARN-1747:


Re-purposing this JIRA to use cgroups for memory monitoring and assigning to 
myself. 

 Better physical memory monitoring for containers
 

 Key: YARN-1747
 URL: https://issues.apache.org/jira/browse/YARN-1747
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla

 YARN currently uses RSS to compute the physical memory being used by a 
 container. This can lead to issues, as noticed in HDFS-5957.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1747) Better physical memory monitoring for containers


 [ 
https://issues.apache.org/jira/browse/YARN-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla reassigned YARN-1747:
--

Assignee: Karthik Kambatla

 Better physical memory monitoring for containers
 

 Key: YARN-1747
 URL: https://issues.apache.org/jira/browse/YARN-1747
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 YARN currently uses RSS to compute the physical memory being used by a 
 container. This can lead to issues, as noticed in HDFS-5957.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk


 [ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1855:
--

Assignee: Cindy Li

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Cindy Li
Priority: Critical

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


 [ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1849:
---

Attachment: yarn-1849-2.patch

Thinking more, thought we could benefit from better logging for the various 
null cases even if we are ignoring all of them. The new patch does that and 
factors handling the ContainerStatus to a different method. 

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


 [ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1849:
---

Attachment: yarn-1849-2.patch

Cosmetic import fix. 

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs

2014-03-19 Thread Mayank Bansal (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940879#comment-13940879
]

Mayank Bansal commented on YARN-1809:
-

Thanks [~zjshen] for the patch.
Herer are some comments

1. Change name from ApplicationInformationProtocol to like
ApplicationBaseProtocol
2. Why we cant have delegationtoken related api's to Base Protocol?
3. ApplicationHistoryClientService - Why we removing protocol handler? I think
we should keep it as it was.
4. I am not sure why we removed the ApplicationContext, I think
ApplicationContext shoule be retained
Isn't it that good if we have the following structure
bq . ApplicationBaseProtocol derived by ApplicationContext
Thoughts?
5. There are lot of refactoring in the patch , which is good but we could have
seprated in two JIRAs which will make changes central to specific issue.
Thoughts?

Synchronize RM and Generic History Service Web-UIs
--

Key: YARN-1809
URL: https://issues.apache.org/jira/browse/YARN-1809
Project: Hadoop YARN
Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch,
YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch

After YARN-953, the web-UI of generic history service is provide more
information than that of RM, the details about app attempt and container.
It's good to provide similar web-UIs, but retrieve the data from separate
source, i.e., RM cache and history store respectively.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940936#comment-13940936
 ] 

Hadoop QA commented on YARN-1849:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12635622/yarn-1849-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1492 javac 
compiler warnings (more than the trunk's current 1491 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3398//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3398//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3398//console

This message is automatically generated.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-03-19 Thread Thomas Graves (JIRA)

Thomas Graves created YARN-1857:
---

 Summary: CapacityScheduler headroom doesn't account for other AM's 
running
 Key: YARN-1857
 URL: https://issues.apache.org/jira/browse/YARN-1857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves


Its possible to get an application to hang forever (or a long time) in a 
cluster with multiple users.  The reason why is that the headroom sent to the 
application is based on the user limit but it doesn't account for other 
Application masters using space in that queue.  So the headroom (user limit 
(100%) - user consumed) can be  0 even though the cluster is 100% full because 
the other space is being used by application masters from other users.  

For instance if you have a cluster with 1 queue, user limit is 100%, you have 
multiple users submitting applications.  One very large application by user 1 
starts up, runs most of its maps and starts running reducers. other users try 
to start applications and get their application masters started but not tasks.  
The very large application then gets to the point where it has consumed the 
rest of the cluster resources with all reduces.  But at this point it needs to 
still finish a few maps.  The headroom being sent to this application is only 
based on the user limit (which is 100% of the cluster capacity) its using lets 
say 95% of the cluster for reduces and then other 5% is being used by other 
users running application masters.  The MRAppMaster thinks it still has 5% so 
it doesn't know that it should kill a reduce in order to run a map.  

This can happen in other scenarios also.  Generally in a large cluster with 
multiple queues this shouldn't cause a hang forever but it could cause the 
application to take much longer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

2014-03-19 Thread Thomas Graves (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thomas Graves updated YARN-1857:

Description:
Its possible to get an application to hang forever (or a long time) in a
cluster with multiple users. The reason why is that the headroom sent to the
application is based on the user limit but it doesn't account for other
Application masters using space in that queue. So the headroom (user limit -
user consumed) can be 0 even though the cluster is 100% full because the
other space is being used by application masters from other users.

For instance if you have a cluster with 1 queue, user limit is 100%, you have
multiple users submitting applications. One very large application by user 1
starts up, runs most of its maps and starts running reducers. other users try
to start applications and get their application masters started but not tasks.
The very large application then gets to the point where it has consumed the
rest of the cluster resources with all reduces. But at this point it needs to
still finish a few maps. The headroom being sent to this application is only
based on the user limit (which is 100% of the cluster capacity) its using lets
say 95% of the cluster for reduces and then other 5% is being used by other
users running application masters. The MRAppMaster thinks it still has 5% so
it doesn't know that it should kill a reduce in order to run a map.

This can happen in other scenarios also. Generally in a large cluster with
multiple queues this shouldn't cause a hang forever but it could cause the
application to take much longer.

was:
Its possible to get an application to hang forever (or a long time) in a
cluster with multiple users. The reason why is that the headroom sent to the
application is based on the user limit but it doesn't account for other
Application masters using space in that queue. So the headroom (user limit
(100%) - user consumed) can be 0 even though the cluster is 100% full because
the other space is being used by application masters from other users.

This can happen in other scenarios also. Generally in a large cluster with
multiple queues this shouldn't cause a hang forever but it could cause the
application to take much longer.

CapacityScheduler headroom doesn't account for other AM's running
-

Key: YARN-1857
URL: https://issues.apache.org/jira/browse/YARN-1857
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves

Its possible to get an application to hang forever (or a long time) in a
cluster with multiple users. The reason why is that the headroom sent to the
application is based on the user limit but it doesn't account for other
Application masters using space in that queue. So the headroom (user limit -
user consumed) can be 0 even though the cluster is 100% full because the
other space is being used by application masters from other users.
For instance if you have a cluster with 1 queue, user limit is 100%, you have
multiple users submitting applications. One very large application by user 1
starts up, runs most of its maps and starts running reducers. other users try
to start applications and get their application masters started but not
tasks. The very large application then gets to the point where it has
consumed the rest of the cluster resources with all reduces. But at this
point it needs to still finish a few maps. The headroom being sent to this
application is only based on the user limit (which is 100% of the cluster
capacity) its using lets say 95% of the cluster for reduces and then other 5%
is being used by other users running application masters. The MRAppMaster
thinks it still has 5% so it doesn't know that it should kill a reduce in
order to run a map.
This can happen in other scenarios also.

[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


 [ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1849:
---

Attachment: yarn-1849-3.patch

Test failure is unrelated. New patch fixes javac warning.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

[
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940984#comment-13940984
]

Vinod Kumar Vavilapalli commented on YARN-1857:
---

This is just one of the items tracked at YARN-1198. Will convert it as a
sub-task.

CapacityScheduler headroom doesn't account for other AM's running
-

Key: YARN-1857
URL: https://issues.apache.org/jira/browse/YARN-1857
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running

[
https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod Kumar Vavilapalli updated YARN-1857:
--

Issue Type: Sub-task (was: Bug)
Parent: YARN-1198

CapacityScheduler headroom doesn't account for other AM's running
-

Key: YARN-1857
URL: https://issues.apache.org/jira/browse/YARN-1857
Project: Hadoop YARN
Issue Type: Sub-task
Components: capacityscheduler
Affects Versions: 2.3.0
Reporter: Thomas Graves

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers


[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940997#comment-13940997
 ] 

Vinod Kumar Vavilapalli commented on YARN-1856:
---

Duplicate of YARN-3?

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941008#comment-13941008
 ] 

Hadoop QA commented on YARN-1849:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12635635/yarn-1849-3.patch
  against trunk revision .

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3400//console

This message is automatically generated.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers


[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941023#comment-13941023
 ] 

Karthik Kambatla commented on YARN-1856:


As discussed on YARN-3, using cgroups for memory isolation/enforcement can be 
problematic as it enforces an upper-bound on the amount of memory tasks can 
consume and hence doesn't tolerate any momentary spikes. 

Using it for monitoring, however, would help address YARN-1747.

I haven't yet looked at the cgroups-related source closely enough. Can post an 
update once I do that.  

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers


[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941024#comment-13941024
 ] 

Karthik Kambatla commented on YARN-1856:


bq. When we use cgroups, we don't need (and want) explicit monitoring.
If we set the limits much higher than what we want to enforce, we can use them 
for monitoring instead. The goal, again, is not to enforce. 

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers


[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941018#comment-13941018
 ] 

Vinod Kumar Vavilapalli commented on YARN-1856:
---

When we use cgroups, we don't need (and want) explicit monitoring. Cgroups are 
going to constrain memory usage of the process (and the tree) if the right 
values are set when creating the group. There were some discussions on YARN-3 
and related JIRAs related to this. In essence, the ContainersMonitor is really 
a monitor to be used only when such a OS feature is not available to properly 
constrain memory-usage.

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters

2014-03-19 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941037#comment-13941037
 ] 

Jian He commented on YARN-1849:
---

Hi, I want to take a look at the patch, can you wait for some time ?  I'll do 
it today.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters

2014-03-19 Thread Alejandro Abdelnur (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941007#comment-13941007
 ] 

Alejandro Abdelnur commented on YARN-1849:
--

+1 pending jenkins.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers


[ 
https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941011#comment-13941011
 ] 

Karthik Kambatla commented on YARN-1856:


Nope. YARN-3, IIUC, is just for CPU. Also, we don't want to enforce memory 
through cgroups - this is just for monitoring. 

 cgroups based memory monitoring for containers
 --

 Key: YARN-1856
 URL: https://issues.apache.org/jira/browse/YARN-1856
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941004#comment-13941004
 ] 

Karthik Kambatla commented on YARN-1849:


Tested the newest patch on a secure cluster with UAM and RM HA. Failover works 
fine. 

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941062#comment-13941062
 ] 

Hadoop QA commented on YARN-1849:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12635635/yarn-1849-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3399//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3399//console

This message is automatically generated.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1815) RM should recover only Managed AMs


[ 
https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941073#comment-13941073
 ] 

Karthik Kambatla commented on YARN-1815:


Even if the UMA finishes successfully, there is no way for the RM to know. At 
least, not until YARN-556. Today, the RM tries to recover the app, but can't 
recover UAM. The corresponding RMApp transitions to FAILED after a while. 

This JIRA is only avoiding those attempts to recover and marking it as FAILED 
early. 

 RM should recover only Managed AMs
 --

 Key: YARN-1815
 URL: https://issues.apache.org/jira/browse/YARN-1815
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: Unmanaged AM recovery.png, yarn-1815-1.patch, 
 yarn-1815-2.patch, yarn-1815-2.patch


 RM should not recover unmanaged AMs until YARN-1823 is fixed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.

2014-03-19 Thread Subramaniam Krishnan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramaniam Krishnan updated YARN-1051:
---

Attachment: techreport.pdf

Attaching an updated Tech Report which enunciates more clearly what we intend 
to achieve, results from our P-o-C and also aligns with the design doc on how 
we propose to implement the same in YARN.

 YARN Admission Control/Planner: enhancing the resource allocation model with 
 time.
 --

 Key: YARN-1051
 URL: https://issues.apache.org/jira/browse/YARN-1051
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler
Reporter: Carlo Curino
Assignee: Carlo Curino
 Attachments: YARN-1051-design.pdf, curino_MSR-TR-2013-108.pdf, 
 techreport.pdf


 In this umbrella JIRA we propose to extend the YARN RM to handle time 
 explicitly, allowing users to reserve capacity over time. This is an 
 important step towards SLAs, long-running services, workflows, and helps for 
 gang scheduling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1775) Create SMAPBasedProcessTree to get PSS information


 [ 
https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1775:
--

 Priority: Major  (was: Minor)
Fix Version/s: (was: 2.5.0)

Started looking at the patch.

But first up, please don't use fix-version to specify your intention. 
Target-version is what you should use, fix-version is set by committers at the 
time of commit.

 Create SMAPBasedProcessTree to get PSS information
 --

 Key: YARN-1775
 URL: https://issues.apache.org/jira/browse/YARN-1775
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: yarn-1775-2.4.0.patch


 Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will 
 make use of PSS for computing the memory usage. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1775) Create SMAPBasedProcessTree to get PSS information


[ 
https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941094#comment-13941094
 ] 

Vinod Kumar Vavilapalli commented on YARN-1775:
---

Some comments:
 - This new class is just an extension of ProcfsBasedProcessTree, so is better 
served by using the same util class (from the admin's point of view) but with 
an additional configuration option to better track RSS
 - Can you explain why we are doing this
{code}
+total += Math.min(info.sharedDirty, info.pss) + info.privateDirty
++ info.privateClean;
{code}

Test
 - Most of the test-code is duplicating testProcfsBasedProcess. Can you avoid 
that?
 - Reuse at least some of MemoryMappingInfo, ProcessMemInfo etc from the 
regular code instead of duplicating in the test?
 
Lost of white spaces in the patch, mostly empty lines.

 Create SMAPBasedProcessTree to get PSS information
 --

 Key: YARN-1775
 URL: https://issues.apache.org/jira/browse/YARN-1775
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: yarn-1775-2.4.0.patch


 Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will 
 make use of PSS for computing the memory usage. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1775) Create SMAPBasedProcessTree to get PSS information

2014-03-19 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941120#comment-13941120
 ] 

Chris Nauroth commented on YARN-1775:
-

[~rajesh.balamohan], thank you for explaining the testing.  These results sound 
very promising!  Also interesting would be confirming that containers still get 
killed for exceeding the limit with private/non-shared pages.

I wonder then if counting RSS still has some potential advantages in certain 
deployments, or if the PSS approach is always superior.  Your testing so far 
seems to indicate that PSS is always superior.  Therefore, should this just be 
combined right into the current code?  (This echoes Vinod's prior comment about 
folding the logic back into {{ProcfsBasedProcessTree}}.)  A conservative 
approach is to introduce a config flag, try to get some experience running it 
in real-world clusters, and then we can flip the default in a later release if 
it goes well.

 Create SMAPBasedProcessTree to get PSS information
 --

 Key: YARN-1775
 URL: https://issues.apache.org/jira/browse/YARN-1775
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: yarn-1775-2.4.0.patch


 Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will 
 make use of PSS for computing the memory usage. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs

2014-03-19 Thread Mayank Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941234#comment-13941234
 ] 

Mayank Bansal commented on YARN-1809:
-

I have tested this patch locally, this works ok with running apps however as 
soon as app is finished the urls starts giving error however they should be 
redirected to ahs urls
Thoughts?

Thanks,
Mayank


 Synchronize RM and Generic History Service Web-UIs
 --

 Key: YARN-1809
 URL: https://issues.apache.org/jira/browse/YARN-1809
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, 
 YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch


 After YARN-953, the web-UI of generic history service is provide more 
 information than that of RM, the details about app attempt and container. 
 It's good to provide similar web-UIs, but retrieve the data from separate 
 source, i.e., RM cache and history store respectively.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1640) Manual Failover does not work in secure clusters


 [ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-1640:


Attachment: YARN-1640.1.patch

Upload the patch without testcase added. 
Will add comment to show how I did the tests in two-node secure cluster

 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk

2014-03-19 Thread Cindy Li (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941248#comment-13941248
 ] 

Cindy Li commented on YARN-1855:


Seems YARN-1690 is the one broke the test case. [~zjshen], can you take a look 
too? 

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Cindy Li
Priority: Critical

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941261#comment-13941261
 ] 

Xuan Gong commented on YARN-1640:
-

I did tests in a two-nodes secure yarn cluster. Did kdestroy for all the nodes, 
and started resourcemanages and nm, then run kinit using admin(also tried rm, 
nm, dn, nn, http keytabs), transited rm1 to active, and verified that NM can 
connect to rm1 successfully. Transited rm2 to active, and verified that NM can 
connect to rm2. 
Also, successfully run mapreduce job and distributedShell job. 

 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters

2014-03-19 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941275#comment-13941275
 ] 

Jian He commented on YARN-1849:
---

Those NULL checks should be valid only for UMA. Normal AM should not happen, if 
it happens, it’s a bug. suggest instead of those NULL checks which may hide 
bug, check if it is UMA, if it is , do not send the container finished events.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941277#comment-13941277
 ] 

Xuan Gong commented on YARN-1640:
-

The reason why it fails is because when we start rm in manual failover. We 
still start adminservice using configured RM principal. When we call 
transitionToActive using different principal, the saslclient compares the 
principal from the adminserver and the its configured principal. At this time, 
the authentication will pass. Since we are using different principal to call 
transitiinToActive, it will actually create the rpc and start all active 
services with second principal. So, when NM tries to connect rm, the 
authentication will fail.

 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941283#comment-13941283
 ] 

Karthik Kambatla commented on YARN-1849:


Thanks [~jianhe]. Agree with you partially; in fact, I was thinking of doing 
that initially. However, in case we do end up into these NULLs for managed AMs, 
not handling them leads to the NM going down. Logging the errors will let us 
know that things are wrong, but not take the nodes down. 

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941282#comment-13941282
 ] 

Xuan Gong commented on YARN-1640:
-

In this patch, we create an rmloginUGI to save the UGI which is used to 
doSecureLogin, and use it start active services. In secure model, the 
rmLoginUGI will be the loginUGI, and in non-secure model, it will be 
currentUGI. 

 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941309#comment-13941309
 ] 

Hadoop QA commented on YARN-1640:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12635698/YARN-1640.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3401//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3401//console

This message is automatically generated.

 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941310#comment-13941310
 ] 

Vinod Kumar Vavilapalli commented on YARN-1849:
---

Haven't looked at the patch, but in general there is a constant tussle between 
keeping things up vs failing fast so as to be able to fix bugs.

I would in general avoid null checks unless I am sure - failing the RM/NM at 
least uncovers the bug instead of limping with it and then breaking somewhere 
else at which point it becomes hard to root-cause. If possible, let's fix what 
is actually broken here instead of putting in a lot of null checks (if that is 
what the above comments are talking about). Sure, we may run into one more 
issue that we haven't foreseen, but we can atleast comfort in knowing that we 
are addressing the right corner cases.

 NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
 

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk


[ 
https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941312#comment-13941312
 ] 

Vinod Kumar Vavilapalli commented on YARN-1855:
---

Do we know the actual bug and the corresponding bug-fix?

 TestRMFailover#testRMWebAppRedirect fails in trunk
 --

 Key: YARN-1855
 URL: https://issues.apache.org/jira/browse/YARN-1855
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Ted Yu
Assignee: Cindy Li
Priority: Critical

 From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console :
 {code}
 testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover)  Time 
 elapsed: 5.39 sec   ERROR!
 java.lang.NullPointerException: null
   at 
 org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-1858) Allow containers can be allocated in groups

2014-03-19 Thread Michael Lv (JIRA)

Michael Lv created YARN-1858:


 Summary: Allow containers can be allocated in groups
 Key: YARN-1858
 URL: https://issues.apache.org/jira/browse/YARN-1858
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Michael Lv


Currently when applications running on YARN send resource request to RM to 
allocate resources, after the response is received, there is no good way to 
associate the results with the original request. We propose to add a field in 
each request to identify the resource request, so the resource received can be 
grouped to resource requests. This new field can be user managed and YARN only 
need to carry this field forward into the responses so user application can 
associate the received resources with the original request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1858) Allow containers can be allocated in groups

2014-03-19 Thread Michael Lv (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941320#comment-13941320
 ] 

Michael Lv commented on YARN-1858:
--

The concept is similar to HTTP cookie, within each AM/App scope, resource 
request can be tagged using the new field to get resources in groups when 
resources are received/updated via RM/AM heartbeat. 

 Allow containers can be allocated in groups
 ---

 Key: YARN-1858
 URL: https://issues.apache.org/jira/browse/YARN-1858
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: Michael Lv

 Currently when applications running on YARN send resource request to RM to 
 allocate resources, after the response is received, there is no good way to 
 associate the results with the original request. We propose to add a field in 
 each request to identify the resource request, so the resource received can 
 be grouped to resource requests. This new field can be user managed and YARN 
 only need to carry this field forward into the responses so user application 
 can associate the received resources with the original request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941328#comment-13941328
 ] 

Vinod Kumar Vavilapalli commented on YARN-1640:
---

This looks good. Hard to write unit tests I guess. Good to know the manual 
tests that you have done.

+1, checking this in.

 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters


[ 
https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941344#comment-13941344
 ] 

Hudson commented on YARN-1640:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5362 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5362/])
YARN-1640. Fixed manual failover of ResourceManagers to work correctly in 
secure clusters. Contributed by Xuan Gong. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579510)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java


 Manual Failover does not work in secure clusters
 

 Key: YARN-1640
 URL: https://issues.apache.org/jira/browse/YARN-1640
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Fix For: 2.4.0

 Attachments: YARN-1640.1.patch


 NodeManager gets rejected after manually making one RM as active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-1854) TestRMHA#testStartAndTransitions Fails


 [ 
https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith reassigned YARN-1854:


Assignee: Rohith

 TestRMHA#testStartAndTransitions Fails
 --

 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai
Assignee: Rohith
Priority: Blocker

 {noformat}
 testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)
   Time elapsed: 5.883 sec   FAILURE!
 java.lang.AssertionError: Incorrect value for metric availableMB 
 expected:2048 but was:4096
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)
 Results :
 Failed tests: 
   
 TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
  Incorrect value for metric availableMB expected:2048 but was:4096
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails


[ 
https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941364#comment-13941364
 ] 

Rohith commented on YARN-1854:
--

I will look into Test Case Failure.

 TestRMHA#testStartAndTransitions Fails
 --

 Key: YARN-1854
 URL: https://issues.apache.org/jira/browse/YARN-1854
 Project: Hadoop YARN
  Issue Type: Test
Affects Versions: 2.4.0
Reporter: Mit Desai
Assignee: Rohith
Priority: Blocker

 {noformat}
 testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA)
   Time elapsed: 5.883 sec   FAILURE!
 java.lang.AssertionError: Incorrect value for metric availableMB 
 expected:2048 but was:4096
   at org.junit.Assert.fail(Assert.java:93)
   at org.junit.Assert.failNotEquals(Assert.java:647)
   at org.junit.Assert.assertEquals(Assert.java:128)
   at org.junit.Assert.assertEquals(Assert.java:472)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160)
 Results :
 Failed tests: 
   
 TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396
  Incorrect value for metric availableMB expected:2048 but was:4096
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM


 [ 
https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1849:
---

Summary: NPE in ResourceTrackerService#registerNodeManager for UAM  (was: 
NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters)

 NPE in ResourceTrackerService#registerNodeManager for UAM
 -

 Key: YARN-1849
 URL: https://issues.apache.org/jira/browse/YARN-1849
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, 
 yarn-1849-3.patch


 While running an UnmanagedAM on secure cluster, ran into an NPE on 
 failover/restart. This is similar to YARN-1821. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1776) renewDelegationToken should survive RM failover


 [ 
https://issues.apache.org/jira/browse/YARN-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1776:
--

Attachment: YARN-1776.1.patch

I created a patch:

1. Add updateRMDelegationTokenAndSequenceNumber to RMStateStore.

2. For MemoryRMStateStore, we don't need to make the method atomic as the 
memory is lost when RM fails. Therefore, it just a simple wrapper of 
storeRMDelegationTokenAndSequenceNumber and removeRMDelegationToken.

3. For ZKRMStateStore,  I make use of opList to group both delete and store 
operations together, to ensure all or no operations get succeeded.

4. For FileSystemRMStateStore, it is a difficult case: since we're not just 
touching a single file, it's hard to make all or no fs operations succeed. 
Therefore, I just leave it as what I've done for MemoryRMStateStore. Meanwhile, 
storeRMDelegationTokenAndSequenceNumber itself is not atomic as well.

The good thing, is that RM failover is supposed to work with ZK impl. Hopefully 
it is still OK. Thoughts?

5. RMDelegationTokenSecretManager#updateStoredToken calls 
updateRMDelegationTokenAndSequenceNumber then.

6. Add the test for updateRMDelegationTokenAndSequenceNumber.

 renewDelegationToken should survive RM failover
 ---

 Key: YARN-1776
 URL: https://issues.apache.org/jira/browse/YARN-1776
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Attachments: YARN-1776.1.patch


 When a delegation token is renewed, two RMStateStore operations: 1) removing 
 the old DT, and 2) storing the new DT will happen. If RM fails in between. 
 There would be problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1776) renewDelegationToken should survive RM failover