[jira] [Created] (YARN-1851) Unable to parse launch time from job history file
Fengdong Yu created YARN-1851: - Summary: Unable to parse launch time from job history file Key: YARN-1851 URL: https://issues.apache.org/jira/browse/YARN-1851 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Priority: Minor Fix For: 2.4.0 when job complete, there are WARN complains in the log: {code} 2014-03-19 13:31:10,036 WARN org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse launch time from job history file job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist : java.lang.NumberFormatException: For input string: queue {code} because there is '-' in the queue name 'test-queue', we split the job history file name by '-', and get the ninth item as job start time. FileNameIndexUtils.java {code} private static final int JOB_START_TIME_INDEX = 9; {code} but there is another potential issue: if I also include '-' in the job name(test_one_world in this case), there are all misunderstand. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1851) Unable to parse launch time from job history file
[ https://issues.apache.org/jira/browse/YARN-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated YARN-1851: -- Description: when job complete, there are WARN complains in the log: {code} 2014-03-19 13:31:10,036 WARN org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse launch time from job history file job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist : java.lang.NumberFormatException: For input string: queue {code} because there is (-) in the queue name 'test-queue', we split the job history file name by (-), and get the ninth item as job start time. FileNameIndexUtils.java {code} private static final int JOB_START_TIME_INDEX = 9; {code} but there is another potential issue: if I also include '-' in the job name(test_one_world in this case), there are all misunderstand. was: when job complete, there are WARN complains in the log: {code} 2014-03-19 13:31:10,036 WARN org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse launch time from job history file job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist : java.lang.NumberFormatException: For input string: queue {code} because there is - in the queue name 'test-queue', we split the job history file name by -, and get the ninth item as job start time. FileNameIndexUtils.java {code} private static final int JOB_START_TIME_INDEX = 9; {code} but there is another potential issue: if I also include '-' in the job name(test_one_world in this case), there are all misunderstand. Unable to parse launch time from job history file - Key: YARN-1851 URL: https://issues.apache.org/jira/browse/YARN-1851 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Priority: Minor Fix For: 2.4.0 when job complete, there are WARN complains in the log: {code} 2014-03-19 13:31:10,036 WARN org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse launch time from job history file job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist : java.lang.NumberFormatException: For input string: queue {code} because there is (-) in the queue name 'test-queue', we split the job history file name by (-), and get the ninth item as job start time. FileNameIndexUtils.java {code} private static final int JOB_START_TIME_INDEX = 9; {code} but there is another potential issue: if I also include '-' in the job name(test_one_world in this case), there are all misunderstand. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1850) Make enabling timeline service configurable
[ https://issues.apache.org/jira/browse/YARN-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1850: -- Attachment: YARN-1850.1.patch Create a patch which can disable timeline service: timeline client won't put entities and events to the timeline server. I set the default to true to not disturb the users who have already played with this feature. I've tested the patch locally: when the timeline service is disabled, the DS client won't put any data to the timeline server. Make enabling timeline service configurable Key: YARN-1850 URL: https://issues.apache.org/jira/browse/YARN-1850 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1850.1.patch Like generic history service, we'd better to make enabling timeline service configurable, in case the timeline server is not up -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1850) Make enabling timeline service configurable
[ https://issues.apache.org/jira/browse/YARN-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940261#comment-13940261 ] Hadoop QA commented on YARN-1850: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635497/YARN-1850.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.client.TestRMFailover {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3396//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3396//console This message is automatically generated. Make enabling timeline service configurable Key: YARN-1850 URL: https://issues.apache.org/jira/browse/YARN-1850 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1850.1.patch Like generic history service, we'd better to make enabling timeline service configurable, in case the timeline server is not up -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs
Rohith created YARN-1852: Summary: Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs Key: YARN-1852 URL: https://issues.apache.org/jira/browse/YARN-1852 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0 Reporter: Rohith Assignee: Rohith Priority: Minor Recovering for failed/killed application throw InvalidStateTransitonException. These are logged during recovery of applications. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs
[ https://issues.apache.org/jira/browse/YARN-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940394#comment-13940394 ] Rohith commented on YARN-1852: -- Here is the exception stack trace.. For Killed application state=KILLED {noformat} 2014-03-19 14:26:11,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1394526371652_0004 with 1 attempts and final state = KILLED 2014-03-19 14:26:11,618 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=Application Finished - Killed TARGET=RMAppManager RESULT=SUCCESS APPID=application_1394526371652_0003 2014-03-19 14:26:11,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1394526371652_0004_01 with final state: KILLED 2014-03-19 14:26:11,618 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1394526371652_0003,name=Sleep job,user=root,queue=default,state=KILLED,trackingUrl=host-10-18-40-77:45020/cluster/app/application_1394526371652_0003,appMasterHost=N/A,startTime=1394526759247,finishTime=1394527194947,finalStatus=KILLED 2014-03-19 14:26:11,619 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1394526371652_0004_01 2014-03-19 14:26:11,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1394526371652_0004_01 State change from NEW to KILLED 2014-03-19 14:26:11,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1394526371652_0004 State change from NEW to KILLED 2014-03-19 14:26:11,619 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_KILLED at KILLED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:632) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:690) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:674) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:662) {noformat} For failed application state=FAILED {noformat} 2014-03-19 14:26:11,614 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1394528000856_0003 with 2 attempts and final state = FAILED 2014-03-19 14:26:11,614 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1395139734891_0003,name=Sleep job,user=root,queue=d,state=FINISHED,trackingUrl=http://host-10-18-40-77:45020/proxy/application_1395139734891_0003/jobhistory/job/job_1395139734891_0003,appMasterHost=N/A,startTime=1395141914653,finishTime=1395141933121,finalStatus=SUCCEEDED 2014-03-19 14:26:11,614 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1394528000856_0003_01 with final state: FAILED 2014-03-19 14:26:11,615 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1394528000856_0003_02 with final state: FAILED 2014-03-19 14:26:11,615 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1394528000856_0003_01 State change from NEW to FAILED 2014-03-19 14:26:11,615 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1394528000856_0003_02 2014-03-19 14:26:11,615 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1394528000856_0003_02 State change from NEW to FAILED 2014-03-19 14:26:11,616 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1394528000856_0003 State change from NEW to FAILED 2014-03-19 14:26:11,616 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ATTEMPT_FAILED at FAILED
[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940400#comment-13940400 ] Hudson commented on YARN-1690: -- FAILURE: Integrated in Hadoop-Yarn-trunk #514 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/514/]) YARN-1690. Made DistributedShell send timeline entities+events. Contributed by Mayank Bansal. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579123) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java Sending timeline entities+events from Distributed shell Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.4.0 Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1705) Reset cluster-metrics on transition to standby
[ https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940399#comment-13940399 ] Hudson commented on YARN-1705: -- FAILURE: Integrated in Hadoop-Yarn-trunk #514 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/514/]) YARN-1705. Reset cluster-metrics on transition to standby. (Rohith via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579014) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java Reset cluster-metrics on transition to standby -- Key: YARN-1705 URL: https://issues.apache.org/jira/browse/YARN-1705 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: YARN-1705.1.patch, YARN-1705.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1705) Reset cluster-metrics on transition to standby
[ https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940496#comment-13940496 ] Hudson commented on YARN-1705: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1706 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1706/]) YARN-1705. Reset cluster-metrics on transition to standby. (Rohith via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579014) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java Reset cluster-metrics on transition to standby -- Key: YARN-1705 URL: https://issues.apache.org/jira/browse/YARN-1705 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: YARN-1705.1.patch, YARN-1705.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940497#comment-13940497 ] Hudson commented on YARN-1690: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1706 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1706/]) YARN-1690. Made DistributedShell send timeline entities+events. Contributed by Mayank Bansal. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579123) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java Sending timeline entities+events from Distributed shell Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.4.0 Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1690) Sending timeline entities+events from Distributed shell
[ https://issues.apache.org/jira/browse/YARN-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940538#comment-13940538 ] Hudson commented on YARN-1690: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1731 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1731/]) YARN-1690. Made DistributedShell send timeline entities+events. Contributed by Mayank Bansal. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579123) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/test/java/org/apache/hadoop/yarn/applications/distributedshell/TestDistributedShell.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java Sending timeline entities+events from Distributed shell Key: YARN-1690 URL: https://issues.apache.org/jira/browse/YARN-1690 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Fix For: 2.4.0 Attachments: YARN-1690-1.patch, YARN-1690-2.patch, YARN-1690-3.patch, YARN-1690-4.patch, YARN-1690-5.patch, YARN-1690-6.patch, YARN-1690-7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1705) Reset cluster-metrics on transition to standby
[ https://issues.apache.org/jira/browse/YARN-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940537#comment-13940537 ] Hudson commented on YARN-1705: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1731 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1731/]) YARN-1705. Reset cluster-metrics on transition to standby. (Rohith via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579014) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java Reset cluster-metrics on transition to standby -- Key: YARN-1705 URL: https://issues.apache.org/jira/browse/YARN-1705 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Rohith Attachments: YARN-1705.1.patch, YARN-1705.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-1833: -- Fix Version/s: 2.4.0 TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Fix For: 3.0.0, 2.4.0, 2.5.0 Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1833) TestRMAdminService Fails in trunk and branch-2
[ https://issues.apache.org/jira/browse/YARN-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940543#comment-13940543 ] Jonathan Eagles commented on YARN-1833: --- Added this test only fix to 2.4.0 release since it is really hindering my testing efforts on that line. TestRMAdminService Fails in trunk and branch-2 -- Key: YARN-1833 URL: https://issues.apache.org/jira/browse/YARN-1833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.4.0 Reporter: Mit Desai Assignee: Mit Desai Labels: Test Fix For: 3.0.0, 2.4.0, 2.5.0 Attachments: YARN-1833-v2.patch, YARN-1833.patch In the test testRefreshUserToGroupsMappingsWithFileSystemBasedConfigurationProvider, the following assert is not needed. {code} Assert.assertTrue(groupWithInit.size() != groupBefore.size()); {code} As the assert takes the default groups for groupWithInit (which in my case are users, sshusers and wheel), it fails as the size of both groupWithInit and groupBefore are same. I do not think we need to have this assert here. Moreover we are also checking that the groupInit does not have the userGroups that are in the groupBefore so removing the assert may not be harmful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1853) Allow containers to be ran under real user even in insecure mode
Andrey Stepachev created YARN-1853: -- Summary: Allow containers to be ran under real user even in insecure mode Key: YARN-1853 URL: https://issues.apache.org/jira/browse/YARN-1853 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Andrey Stepachev Currently unsecure cluster runs all containers under one user (typically nobody). That is not appropriate, because yarn applications doesn't play well with hdfs having enabled permissions. Yarn applications try to write data (as expected) into /user/nobody regardless of user, who launched application. Another sideeffect is that it is not possible to configure cgroups for particular users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1853) Allow containers to be ran under real user even in insecure mode
[ https://issues.apache.org/jira/browse/YARN-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1853: --- Attachment: YARN-1853.patch My propose it to use parameter 'yarn.nodemanager.linux-container-executor.nonsecure-mode.impersonate' (with default to true) which will control, should yarn impersonate container in insecure mode, or should run it under concrete user. Allow containers to be ran under real user even in insecure mode Key: YARN-1853 URL: https://issues.apache.org/jira/browse/YARN-1853 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Andrey Stepachev Attachments: YARN-1853.patch Currently unsecure cluster runs all containers under one user (typically nobody). That is not appropriate, because yarn applications doesn't play well with hdfs having enabled permissions. Yarn applications try to write data (as expected) into /user/nobody regardless of user, who launched application. Another sideeffect is that it is not possible to configure cgroups for particular users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1853) Allow containers to be ran under real user even in insecure mode
[ https://issues.apache.org/jira/browse/YARN-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Stepachev updated YARN-1853: --- Affects Version/s: 2.2.0 Allow containers to be ran under real user even in insecure mode Key: YARN-1853 URL: https://issues.apache.org/jira/browse/YARN-1853 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.2.0 Reporter: Andrey Stepachev Attachments: YARN-1853.patch Currently unsecure cluster runs all containers under one user (typically nobody). That is not appropriate, because yarn applications doesn't play well with hdfs having enabled permissions. Yarn applications try to write data (as expected) into /user/nobody regardless of user, who launched application. Another sideeffect is that it is not possible to configure cgroups for particular users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1854) TestRMHA#testStartAndTransitions Fails
Mit Desai created YARN-1854: --- Summary: TestRMHA#testStartAndTransitions Fails Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated YARN-1855: - Summary: TestRMFailover#testRMWebAppRedirect fails in trunk (was: TestRMFailover#testRMWebAppRedirect fails occasionally in trunk) TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940625#comment-13940625 ] Ted Yu commented on YARN-1855: -- I tried this: {code} Index: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java === --- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java (revision 1579270) +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java (working copy) @@ -286,7 +286,8 @@ try { MapString, ListString map = new URL(url).openConnection().getHeaderFields(); - fieldHeader = map.get(field).get(0); + ListString lst = map.get(field); + if (lst != null) fieldHeader = lst.get(0); } catch (Exception e) { // throw new RuntimeException(e); } {code} However, the next assertion fails: {code} header = getHeader(Refresh, rm2Url + /ws/v1/cluster/apps); assertTrue(header.contains(; url= + rm1Url)); {code} header was null. TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1854) TestRMHA#testStartAndTransitions Fails
[ https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1854: -- Priority: Blocker (was: Major) Target Version/s: 2.4.0 TestRMHA#testStartAndTransitions Fails -- Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai Priority: Blocker {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails occasionally in trunk
Ted Yu created YARN-1855: Summary: TestRMFailover#testRMWebAppRedirect fails occasionally in trunk Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940644#comment-13940644 ] Zhijie Shen commented on YARN-1855: --- I could reproduce the test failure locally as well, and saw jenkins reported it as well: https://builds.apache.org/job/PreCommit-YARN-Build/3396//testReport/org.apache.hadoop.yarn.client/TestRMFailover/testRMWebAppRedirect/ TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails
[ https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940609#comment-13940609 ] Karthik Kambatla commented on YARN-1854: YARN-1705 introduced this check - I ran it multiple times while committing it, and it succeeded. [~mitdesai] - are you able to reproduce this deterministically? TestRMHA#testStartAndTransitions Fails -- Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs
[ https://issues.apache.org/jira/browse/YARN-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1852: -- Priority: Major (was: Minor) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs - Key: YARN-1852 URL: https://issues.apache.org/jira/browse/YARN-1852 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0 Reporter: Rohith Assignee: Rohith Recovering for failed/killed application throw InvalidStateTransitonException. These are logged during recovery of applications. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1855: -- Priority: Critical (was: Major) TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Critical From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940685#comment-13940685 ] Karthik Kambatla commented on YARN-1855: [~cindyli] - will you be able to take a look at this? Otherwise, I can jump on it tomorrow. TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Critical From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails
[ https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940707#comment-13940707 ] Mit Desai commented on YARN-1854: - It I got that failing in our nightly builds. When I tested it on my local machine, I got the same error. But now when I tried testing it again, I get the following error intermittently. {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 1.755 sec FAILURE! java.lang.AssertionError: Incorrect value for metric appsPending expected:1 but was:0 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:384) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:154) Results : Failed tests: TestRMHA.testStartAndTransitions:154-verifyClusterMetrics:384-assertMetric:396 Incorrect value for metric appsPending expected:1 but was:0 {noformat} TestRMHA#testStartAndTransitions Fails -- Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai Priority: Blocker {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1842) InvalidApplicationMasterRequestException raised during AM-requested shutdown
[ https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940705#comment-13940705 ] Janos Matyas commented on YARN-1842: Hi, This seems to be an issue on OS/X and Debian only. We have just tried on CentOS (for automatic Hoya install on CentOS feel free to use this script - https://github.com/sequenceiq/hadoop-docker/blob/master/hoya-centos-install.sh) and it works fine launching HBase containers. Also we have tried our custom Apache Flume provider (https://github.com/sequenceiq/hoya) and it works well - launching and stoping containers as supposed. A quick note: on Debian and OS/X there are different exceptions if you launch the containers using IP address or localhost (hoya create hbase --role master 1 --role worker 1 --manager localhost:8032 --filesystem hdfs://localhost:9000 --image hdfs://localhost:9000/hbase.tar.gz --appconf file:///tmp/hoya-master/hoya-core/src/main/resources/org/apache/hoya/providers/hbase/conf --zkhosts localhost) InvalidApplicationMasterRequestException raised during AM-requested shutdown Key: YARN-1842 URL: https://issues.apache.org/jira/browse/YARN-1842 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Steve Loughran Priority: Minor Attachments: hoyalogs.tar.gz Report of the RM raising a stack trace [https://gist.github.com/matyix/9596735] during AM-initiated shutdown. The AM could just swallow this and exit, but it could be a sign of a race condition YARN-side, or maybe just in the RM client code/AM dual signalling the shutdown. I haven't replicated this myself; maybe the stack will help track down the problem. Otherwise: what is the policy YARN apps should adopt for AM's handling errors on shutdown? go straight to an exit(-1)? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails
[ https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940730#comment-13940730 ] Karthik Kambatla commented on YARN-1854: Thanks Mit. It is likely a race in the test. [~rohithsharma] - will you be able to look into this? Otherwise, I ll be able to jump on it tomorrow. TestRMHA#testStartAndTransitions Fails -- Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai Priority: Blocker {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940746#comment-13940746 ] Hadoop QA commented on YARN-1849: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635573/yarn-1849-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3397//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3397//console This message is automatically generated. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1850) Make enabling timeline service configurable
[ https://issues.apache.org/jira/browse/YARN-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940646#comment-13940646 ] Zhijie Shen commented on YARN-1850: --- The test failure is not related. See YARN-1855. Make enabling timeline service configurable Key: YARN-1850 URL: https://issues.apache.org/jira/browse/YARN-1850 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1850.1.patch Like generic history service, we'd better to make enabling timeline service configurable, in case the timeline server is not up -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1849: --- Attachment: yarn-1849-1.patch NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1851) Unable to parse launch time from job history file
[ https://issues.apache.org/jira/browse/YARN-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA reassigned YARN-1851: --- Assignee: Akira AJISAKA Unable to parse launch time from job history file - Key: YARN-1851 URL: https://issues.apache.org/jira/browse/YARN-1851 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Assignee: Akira AJISAKA Priority: Minor Fix For: 2.4.0 when job complete, there are WARN complains in the log: {code} 2014-03-19 13:31:10,036 WARN org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse launch time from job history file job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist : java.lang.NumberFormatException: For input string: queue {code} because there is (-) in the queue name 'test-queue', we split the job history file name by (-), and get the ninth item as job start time. FileNameIndexUtils.java {code} private static final int JOB_START_TIME_INDEX = 9; {code} but there is another potential issue: if I also include '-' in the job name(test_one_world in this case), there are all misunderstand. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1855: -- Target Version/s: 2.4.0 I believe this also affects 2.4. TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Priority: Critical From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1851) Unable to parse launch time from job history file
[ https://issues.apache.org/jira/browse/YARN-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940779#comment-13940779 ] Akira AJISAKA commented on YARN-1851: - I looked around the code and found that user name and job name are escaped, but queue name is not escaped. I'll create a patch to escape queue name shortly. Unable to parse launch time from job history file - Key: YARN-1851 URL: https://issues.apache.org/jira/browse/YARN-1851 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Fengdong Yu Assignee: Akira AJISAKA Priority: Minor Fix For: 2.4.0 when job complete, there are WARN complains in the log: {code} 2014-03-19 13:31:10,036 WARN org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse launch time from job history file job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist : java.lang.NumberFormatException: For input string: queue {code} because there is (-) in the queue name 'test-queue', we split the job history file name by (-), and get the ninth item as job start time. FileNameIndexUtils.java {code} private static final int JOB_START_TIME_INDEX = 9; {code} but there is another potential issue: if I also include '-' in the job name(test_one_world in this case), there are all misunderstand. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940640#comment-13940640 ] Karthik Kambatla commented on YARN-1849: This time around, it turns out the master container is null: {code} if (rmAppAttempt != null) { if (rmAppAttempt.getMasterContainer().getId() .equals(containerStatus.getContainerId()) containerStatus.getState() == ContainerState.COMPLETE) {code} Looks like it is not necessary for an UnmanagedAM to have a master container. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1852) Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs
[ https://issues.apache.org/jira/browse/YARN-1852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940822#comment-13940822 ] Jian He commented on YARN-1852: --- This seems most likely due to , we are replaying the attempt's BaseFinalTransition logic which causes sending a new FAILED/KILLED event, while RMApp already moves to FAILED/KILLED state. We covered the case for FINISHED state but it seems we miss this. Application recovery throws InvalidStateTransitonException for FAILED and KILLED jobs - Key: YARN-1852 URL: https://issues.apache.org/jira/browse/YARN-1852 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0 Reporter: Rohith Assignee: Rohith Recovering for failed/killed application throw InvalidStateTransitonException. These are logged during recovery of applications. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1856) cgroups based memory monitoring for containers
Karthik Kambatla created YARN-1856: -- Summary: cgroups based memory monitoring for containers Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1747) Better physical memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940832#comment-13940832 ] Karthik Kambatla commented on YARN-1747: Re-purposing this JIRA to use cgroups for memory monitoring and assigning to myself. Better physical memory monitoring for containers Key: YARN-1747 URL: https://issues.apache.org/jira/browse/YARN-1747 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla YARN currently uses RSS to compute the physical memory being used by a container. This can lead to issues, as noticed in HDFS-5957. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1747) Better physical memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1747: -- Assignee: Karthik Kambatla Better physical memory monitoring for containers Key: YARN-1747 URL: https://issues.apache.org/jira/browse/YARN-1747 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla YARN currently uses RSS to compute the physical memory being used by a container. This can lead to issues, as noticed in HDFS-5957. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1855: -- Assignee: Cindy Li TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Cindy Li Priority: Critical From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1849: --- Attachment: yarn-1849-2.patch Thinking more, thought we could benefit from better logging for the various null cases even if we are ignoring all of them. The new patch does that and factors handling the ContainerStatus to a different method. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1849: --- Attachment: yarn-1849-2.patch Cosmetic import fix. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940879#comment-13940879 ] Mayank Bansal commented on YARN-1809: - Thanks [~zjshen] for the patch. Herer are some comments 1. Change name from ApplicationInformationProtocol to like ApplicationBaseProtocol 2. Why we cant have delegationtoken related api's to Base Protocol? 3. ApplicationHistoryClientService - Why we removing protocol handler? I think we should keep it as it was. 4. I am not sure why we removed the ApplicationContext, I think ApplicationContext shoule be retained Isn't it that good if we have the following structure bq . ApplicationBaseProtocol derived by ApplicationContext Thoughts? 5. There are lot of refactoring in the patch , which is good but we could have seprated in two JIRAs which will make changes central to specific issue. Thoughts? Synchronize RM and Generic History Service Web-UIs -- Key: YARN-1809 URL: https://issues.apache.org/jira/browse/YARN-1809 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch After YARN-953, the web-UI of generic history service is provide more information than that of RM, the details about app attempt and container. It's good to provide similar web-UIs, but retrieve the data from separate source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940936#comment-13940936 ] Hadoop QA commented on YARN-1849: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635622/yarn-1849-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1492 javac compiler warnings (more than the trunk's current 1491 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3398//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3398//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3398//console This message is automatically generated. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
Thomas Graves created YARN-1857: --- Summary: CapacityScheduler headroom doesn't account for other AM's running Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit (100%) - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1857: Description: Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. was: Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit (100%) - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also.
[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1849: --- Attachment: yarn-1849-3.patch Test failure is unrelated. New patch fixes javac warning. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940984#comment-13940984 ] Vinod Kumar Vavilapalli commented on YARN-1857: --- This is just one of the items tracked at YARN-1198. Will convert it as a sub-task. CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1857: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1198 CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13940997#comment-13940997 ] Vinod Kumar Vavilapalli commented on YARN-1856: --- Duplicate of YARN-3? cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941008#comment-13941008 ] Hadoop QA commented on YARN-1849: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635635/yarn-1849-3.patch against trunk revision . {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3400//console This message is automatically generated. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941023#comment-13941023 ] Karthik Kambatla commented on YARN-1856: As discussed on YARN-3, using cgroups for memory isolation/enforcement can be problematic as it enforces an upper-bound on the amount of memory tasks can consume and hence doesn't tolerate any momentary spikes. Using it for monitoring, however, would help address YARN-1747. I haven't yet looked at the cgroups-related source closely enough. Can post an update once I do that. cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941024#comment-13941024 ] Karthik Kambatla commented on YARN-1856: bq. When we use cgroups, we don't need (and want) explicit monitoring. If we set the limits much higher than what we want to enforce, we can use them for monitoring instead. The goal, again, is not to enforce. cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941018#comment-13941018 ] Vinod Kumar Vavilapalli commented on YARN-1856: --- When we use cgroups, we don't need (and want) explicit monitoring. Cgroups are going to constrain memory usage of the process (and the tree) if the right values are set when creating the group. There were some discussions on YARN-3 and related JIRAs related to this. In essence, the ContainersMonitor is really a monitor to be used only when such a OS feature is not available to properly constrain memory-usage. cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941037#comment-13941037 ] Jian He commented on YARN-1849: --- Hi, I want to take a look at the patch, can you wait for some time ? I'll do it today. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941007#comment-13941007 ] Alejandro Abdelnur commented on YARN-1849: -- +1 pending jenkins. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941011#comment-13941011 ] Karthik Kambatla commented on YARN-1856: Nope. YARN-3, IIUC, is just for CPU. Also, we don't want to enforce memory through cgroups - this is just for monitoring. cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941004#comment-13941004 ] Karthik Kambatla commented on YARN-1849: Tested the newest patch on a secure cluster with UAM and RM HA. Failover works fine. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941062#comment-13941062 ] Hadoop QA commented on YARN-1849: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635635/yarn-1849-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3399//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3399//console This message is automatically generated. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1815) RM should recover only Managed AMs
[ https://issues.apache.org/jira/browse/YARN-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941073#comment-13941073 ] Karthik Kambatla commented on YARN-1815: Even if the UMA finishes successfully, there is no way for the RM to know. At least, not until YARN-556. Today, the RM tries to recover the app, but can't recover UAM. The corresponding RMApp transitions to FAILED after a while. This JIRA is only avoiding those attempts to recover and marking it as FAILED early. RM should recover only Managed AMs -- Key: YARN-1815 URL: https://issues.apache.org/jira/browse/YARN-1815 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: Unmanaged AM recovery.png, yarn-1815-1.patch, yarn-1815-2.patch, yarn-1815-2.patch RM should not recover unmanaged AMs until YARN-1823 is fixed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-1051: --- Attachment: techreport.pdf Attaching an updated Tech Report which enunciates more clearly what we intend to achieve, results from our P-o-C and also aligns with the design doc on how we propose to implement the same in YARN. YARN Admission Control/Planner: enhancing the resource allocation model with time. -- Key: YARN-1051 URL: https://issues.apache.org/jira/browse/YARN-1051 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: YARN-1051-design.pdf, curino_MSR-TR-2013-108.pdf, techreport.pdf In this umbrella JIRA we propose to extend the YARN RM to handle time explicitly, allowing users to reserve capacity over time. This is an important step towards SLAs, long-running services, workflows, and helps for gang scheduling. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1775) Create SMAPBasedProcessTree to get PSS information
[ https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1775: -- Priority: Major (was: Minor) Fix Version/s: (was: 2.5.0) Started looking at the patch. But first up, please don't use fix-version to specify your intention. Target-version is what you should use, fix-version is set by committers at the time of commit. Create SMAPBasedProcessTree to get PSS information -- Key: YARN-1775 URL: https://issues.apache.org/jira/browse/YARN-1775 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: yarn-1775-2.4.0.patch Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will make use of PSS for computing the memory usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1775) Create SMAPBasedProcessTree to get PSS information
[ https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941094#comment-13941094 ] Vinod Kumar Vavilapalli commented on YARN-1775: --- Some comments: - This new class is just an extension of ProcfsBasedProcessTree, so is better served by using the same util class (from the admin's point of view) but with an additional configuration option to better track RSS - Can you explain why we are doing this {code} +total += Math.min(info.sharedDirty, info.pss) + info.privateDirty ++ info.privateClean; {code} Test - Most of the test-code is duplicating testProcfsBasedProcess. Can you avoid that? - Reuse at least some of MemoryMappingInfo, ProcessMemInfo etc from the regular code instead of duplicating in the test? Lost of white spaces in the patch, mostly empty lines. Create SMAPBasedProcessTree to get PSS information -- Key: YARN-1775 URL: https://issues.apache.org/jira/browse/YARN-1775 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: yarn-1775-2.4.0.patch Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will make use of PSS for computing the memory usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1775) Create SMAPBasedProcessTree to get PSS information
[ https://issues.apache.org/jira/browse/YARN-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941120#comment-13941120 ] Chris Nauroth commented on YARN-1775: - [~rajesh.balamohan], thank you for explaining the testing. These results sound very promising! Also interesting would be confirming that containers still get killed for exceeding the limit with private/non-shared pages. I wonder then if counting RSS still has some potential advantages in certain deployments, or if the PSS approach is always superior. Your testing so far seems to indicate that PSS is always superior. Therefore, should this just be combined right into the current code? (This echoes Vinod's prior comment about folding the logic back into {{ProcfsBasedProcessTree}}.) A conservative approach is to introduce a config flag, try to get some experience running it in real-world clusters, and then we can flip the default in a later release if it goes well. Create SMAPBasedProcessTree to get PSS information -- Key: YARN-1775 URL: https://issues.apache.org/jira/browse/YARN-1775 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attachments: yarn-1775-2.4.0.patch Create SMAPBasedProcessTree (by extending ProcfsBasedProcessTree), which will make use of PSS for computing the memory usage. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941234#comment-13941234 ] Mayank Bansal commented on YARN-1809: - I have tested this patch locally, this works ok with running apps however as soon as app is finished the urls starts giving error however they should be redirected to ahs urls Thoughts? Thanks, Mayank Synchronize RM and Generic History Service Web-UIs -- Key: YARN-1809 URL: https://issues.apache.org/jira/browse/YARN-1809 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch After YARN-953, the web-UI of generic history service is provide more information than that of RM, the details about app attempt and container. It's good to provide similar web-UIs, but retrieve the data from separate source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1640: Attachment: YARN-1640.1.patch Upload the patch without testcase added. Will add comment to show how I did the tests in two-node secure cluster Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941248#comment-13941248 ] Cindy Li commented on YARN-1855: Seems YARN-1690 is the one broke the test case. [~zjshen], can you take a look too? TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Cindy Li Priority: Critical From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941261#comment-13941261 ] Xuan Gong commented on YARN-1640: - I did tests in a two-nodes secure yarn cluster. Did kdestroy for all the nodes, and started resourcemanages and nm, then run kinit using admin(also tried rm, nm, dn, nn, http keytabs), transited rm1 to active, and verified that NM can connect to rm1 successfully. Transited rm2 to active, and verified that NM can connect to rm2. Also, successfully run mapreduce job and distributedShell job. Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941275#comment-13941275 ] Jian He commented on YARN-1849: --- Those NULL checks should be valid only for UMA. Normal AM should not happen, if it happens, it’s a bug. suggest instead of those NULL checks which may hide bug, check if it is UMA, if it is , do not send the container finished events. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941277#comment-13941277 ] Xuan Gong commented on YARN-1640: - The reason why it fails is because when we start rm in manual failover. We still start adminservice using configured RM principal. When we call transitionToActive using different principal, the saslclient compares the principal from the adminserver and the its configured principal. At this time, the authentication will pass. Since we are using different principal to call transitiinToActive, it will actually create the rpc and start all active services with second principal. So, when NM tries to connect rm, the authentication will fail. Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941283#comment-13941283 ] Karthik Kambatla commented on YARN-1849: Thanks [~jianhe]. Agree with you partially; in fact, I was thinking of doing that initially. However, in case we do end up into these NULLs for managed AMs, not handling them leads to the NM going down. Logging the errors will let us know that things are wrong, but not take the nodes down. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941282#comment-13941282 ] Xuan Gong commented on YARN-1640: - In this patch, we create an rmloginUGI to save the UGI which is used to doSecureLogin, and use it start active services. In secure model, the rmLoginUGI will be the loginUGI, and in non-secure model, it will be currentUGI. Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941309#comment-13941309 ] Hadoop QA commented on YARN-1640: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635698/YARN-1640.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3401//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3401//console This message is automatically generated. Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941310#comment-13941310 ] Vinod Kumar Vavilapalli commented on YARN-1849: --- Haven't looked at the patch, but in general there is a constant tussle between keeping things up vs failing fast so as to be able to fix bugs. I would in general avoid null checks unless I am sure - failing the RM/NM at least uncovers the bug instead of limping with it and then breaking somewhere else at which point it becomes hard to root-cause. If possible, let's fix what is actually broken here instead of putting in a lot of null checks (if that is what the above comments are talking about). Sure, we may run into one more issue that we haven't foreseen, but we can atleast comfort in knowing that we are addressing the right corner cases. NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1855) TestRMFailover#testRMWebAppRedirect fails in trunk
[ https://issues.apache.org/jira/browse/YARN-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941312#comment-13941312 ] Vinod Kumar Vavilapalli commented on YARN-1855: --- Do we know the actual bug and the corresponding bug-fix? TestRMFailover#testRMWebAppRedirect fails in trunk -- Key: YARN-1855 URL: https://issues.apache.org/jira/browse/YARN-1855 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: Cindy Li Priority: Critical From https://builds.apache.org/job/Hadoop-Yarn-trunk/514/console : {code} testRMWebAppRedirect(org.apache.hadoop.yarn.client.TestRMFailover) Time elapsed: 5.39 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.client.TestRMFailover.testRMWebAppRedirect(TestRMFailover.java:269) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1858) Allow containers can be allocated in groups
Michael Lv created YARN-1858: Summary: Allow containers can be allocated in groups Key: YARN-1858 URL: https://issues.apache.org/jira/browse/YARN-1858 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.3.0 Reporter: Michael Lv Currently when applications running on YARN send resource request to RM to allocate resources, after the response is received, there is no good way to associate the results with the original request. We propose to add a field in each request to identify the resource request, so the resource received can be grouped to resource requests. This new field can be user managed and YARN only need to carry this field forward into the responses so user application can associate the received resources with the original request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1858) Allow containers can be allocated in groups
[ https://issues.apache.org/jira/browse/YARN-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941320#comment-13941320 ] Michael Lv commented on YARN-1858: -- The concept is similar to HTTP cookie, within each AM/App scope, resource request can be tagged using the new field to get resources in groups when resources are received/updated via RM/AM heartbeat. Allow containers can be allocated in groups --- Key: YARN-1858 URL: https://issues.apache.org/jira/browse/YARN-1858 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.3.0 Reporter: Michael Lv Currently when applications running on YARN send resource request to RM to allocate resources, after the response is received, there is no good way to associate the results with the original request. We propose to add a field in each request to identify the resource request, so the resource received can be grouped to resource requests. This new field can be user managed and YARN only need to carry this field forward into the responses so user application can associate the received resources with the original request. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941328#comment-13941328 ] Vinod Kumar Vavilapalli commented on YARN-1640: --- This looks good. Hard to write unit tests I guess. Good to know the manual tests that you have done. +1, checking this in. Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1640) Manual Failover does not work in secure clusters
[ https://issues.apache.org/jira/browse/YARN-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941344#comment-13941344 ] Hudson commented on YARN-1640: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5362 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5362/]) YARN-1640. Fixed manual failover of ResourceManagers to work correctly in secure clusters. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1579510) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java Manual Failover does not work in secure clusters Key: YARN-1640 URL: https://issues.apache.org/jira/browse/YARN-1640 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1640.1.patch NodeManager gets rejected after manually making one RM as active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1854) TestRMHA#testStartAndTransitions Fails
[ https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-1854: Assignee: Rohith TestRMHA#testStartAndTransitions Fails -- Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai Assignee: Rohith Priority: Blocker {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1854) TestRMHA#testStartAndTransitions Fails
[ https://issues.apache.org/jira/browse/YARN-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941364#comment-13941364 ] Rohith commented on YARN-1854: -- I will look into Test Case Failure. TestRMHA#testStartAndTransitions Fails -- Key: YARN-1854 URL: https://issues.apache.org/jira/browse/YARN-1854 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.4.0 Reporter: Mit Desai Assignee: Rohith Priority: Blocker {noformat} testStartAndTransitions(org.apache.hadoop.yarn.server.resourcemanager.TestRMHA) Time elapsed: 5.883 sec FAILURE! java.lang.AssertionError: Incorrect value for metric availableMB expected:2048 but was:4096 at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.assertMetric(TestRMHA.java:396) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.verifyClusterMetrics(TestRMHA.java:387) at org.apache.hadoop.yarn.server.resourcemanager.TestRMHA.testStartAndTransitions(TestRMHA.java:160) Results : Failed tests: TestRMHA.testStartAndTransitions:160-verifyClusterMetrics:387-assertMetric:396 Incorrect value for metric availableMB expected:2048 but was:4096 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1849) NPE in ResourceTrackerService#registerNodeManager for UAM
[ https://issues.apache.org/jira/browse/YARN-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1849: --- Summary: NPE in ResourceTrackerService#registerNodeManager for UAM (was: NPE in ResourceTrackerService#registerNodeManager for UAM on secure clusters) NPE in ResourceTrackerService#registerNodeManager for UAM - Key: YARN-1849 URL: https://issues.apache.org/jira/browse/YARN-1849 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1849-1.patch, yarn-1849-2.patch, yarn-1849-2.patch, yarn-1849-3.patch While running an UnmanagedAM on secure cluster, ran into an NPE on failover/restart. This is similar to YARN-1821. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1776) renewDelegationToken should survive RM failover
[ https://issues.apache.org/jira/browse/YARN-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1776: -- Attachment: YARN-1776.1.patch I created a patch: 1. Add updateRMDelegationTokenAndSequenceNumber to RMStateStore. 2. For MemoryRMStateStore, we don't need to make the method atomic as the memory is lost when RM fails. Therefore, it just a simple wrapper of storeRMDelegationTokenAndSequenceNumber and removeRMDelegationToken. 3. For ZKRMStateStore, I make use of opList to group both delete and store operations together, to ensure all or no operations get succeeded. 4. For FileSystemRMStateStore, it is a difficult case: since we're not just touching a single file, it's hard to make all or no fs operations succeed. Therefore, I just leave it as what I've done for MemoryRMStateStore. Meanwhile, storeRMDelegationTokenAndSequenceNumber itself is not atomic as well. The good thing, is that RM failover is supposed to work with ZK impl. Hopefully it is still OK. Thoughts? 5. RMDelegationTokenSecretManager#updateStoredToken calls updateRMDelegationTokenAndSequenceNumber then. 6. Add the test for updateRMDelegationTokenAndSequenceNumber. renewDelegationToken should survive RM failover --- Key: YARN-1776 URL: https://issues.apache.org/jira/browse/YARN-1776 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1776.1.patch When a delegation token is renewed, two RMStateStore operations: 1) removing the old DT, and 2) storing the new DT will happen. If RM fails in between. There would be problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1776) renewDelegationToken should survive RM failover
[ https://issues.apache.org/jira/browse/YARN-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13941420#comment-13941420 ] Hadoop QA commented on YARN-1776: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12635722/YARN-1776.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3402//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3402//console This message is automatically generated. renewDelegationToken should survive RM failover --- Key: YARN-1776 URL: https://issues.apache.org/jira/browse/YARN-1776 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1776.1.patch When a delegation token is renewed, two RMStateStore operations: 1) removing the old DT, and 2) storing the new DT will happen. If RM fails in between. There would be problem. -- This message was sent by Atlassian JIRA (v6.2#6252)