[jira] [Updated] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3010: - Description: A new findbug issues reported recently: {quote} IS Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html was:A new findbug issues reported recently: https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor A new findbug issues reported recently: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3010: - Attachment: YARN-3010.001.patch Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Attachments: YARN-3010.001.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2996) Refine some fs operations in FileSystemRMStateStore to improve performance
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265760#comment-14265760 ] Hadoop QA commented on YARN-2996: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690261/YARN-2996.002.patch against trunk revision 4cd66f7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6249//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6249//console This message is automatically generated. Refine some fs operations in FileSystemRMStateStore to improve performance -- Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-2996.001.patch, YARN-2996.002.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
Yi Liu created YARN-3010: Summary: Fix recent findbug issue in AbstractYarnScheduler Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor A new findbug issues reported recently: https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3010: - Description: A new findbug issues reported recently in latest trunk: {quote} IS Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html was: A new findbug issues reported recently: {quote} IS Inconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3004) Fix missed synchronization in MemoryRMStateStore
[ https://issues.apache.org/jira/browse/YARN-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264437#comment-14264437 ] Hadoop QA commented on YARN-3004: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690050/YARN-3004.001.patch against trunk revision 21c6f01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6242//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6242//console This message is automatically generated. Fix missed synchronization in MemoryRMStateStore Key: YARN-3004 URL: https://issues.apache.org/jira/browse/YARN-3004 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3004.001.patch In {{MemoryRMStateStore}}, obviously {{state}} variable should be thread-safe, so we need to add _synchronized_ for {code} storeApplicationStateInternal updateApplicationStateInternal storeOrUpdateAMRMTokenSecretManagerState {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264500#comment-14264500 ] Rohith commented on YARN-2684: -- IIUC, to simulate queueplacement policy should be changed.I am able to simulate same using rule name *reject*. Basically, as I seen fairscheduler code no other rule rejects applications other then only rule *reject*. Will upload patch to fix the issue soon. FairScheduler should tolerate queue configuration changes across RM restarts Key: YARN-2684 URL: https://issues.apache.org/jira/browse/YARN-2684 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Rohith Priority: Critical YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3004) Fix missed synchronization in MemoryRMStateStore
Yi Liu created YARN-3004: Summary: Fix missed synchronization in MemoryRMStateStore Key: YARN-3004 URL: https://issues.apache.org/jira/browse/YARN-3004 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu In {{MemoryRMStateStore}}, obviously {{state}} variable should be thread-safe, so we need to add _synchronized_ for {code} storeApplicationStateInternal updateApplicationStateInternal storeOrUpdateAMRMTokenSecretManagerState {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3004) Fix missed synchronization in MemoryRMStateStore
[ https://issues.apache.org/jira/browse/YARN-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-3004: - Attachment: YARN-3004.001.patch Fix missed synchronization in MemoryRMStateStore Key: YARN-3004 URL: https://issues.apache.org/jira/browse/YARN-3004 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-3004.001.patch In {{MemoryRMStateStore}}, obviously {{state}} variable should be thread-safe, so we need to add _synchronized_ for {code} storeApplicationStateInternal updateApplicationStateInternal storeOrUpdateAMRMTokenSecretManagerState {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3000) YARN_PID_DIR should be visible in yarn-env.sh
[ https://issues.apache.org/jira/browse/YARN-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264371#comment-14264371 ] Jeff Zhang commented on YARN-3000: -- It makes sense to make YARN_PID_DIR as deprecated if HADOOP_PID_DIR is used for yarn in trunk. YARN_PID_DIR should be visible in yarn-env.sh - Key: YARN-3000 URL: https://issues.apache.org/jira/browse/YARN-3000 Project: Hadoop YARN Issue Type: Bug Components: scripts Affects Versions: 2.6.0 Reporter: Jeff Zhang Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3000.patch Currently I see YARN_PID_DIR only show in yarn-deamon.sh which is supposed not the place for user to set up enviroment variable. IMO, yarn-env.sh is the place for users to set up enviroment variable just like hadoop-env.sh, so it's better to put YARN_PID_DIR into yarn-env.sh. ( can put it into comment just like YARN_RESOURCEMANAGER_HEAPSIZE ) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3000) YARN_PID_DIR should be visible in yarn-env.sh
[ https://issues.apache.org/jira/browse/YARN-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264373#comment-14264373 ] Jeff Zhang commented on YARN-3000: -- BTW, what's the jira for making HADOOP_PID_DIR to replace YARN_PID_DIR ? As I know hadoop-2.6 didn't do that. YARN_PID_DIR should be visible in yarn-env.sh - Key: YARN-3000 URL: https://issues.apache.org/jira/browse/YARN-3000 Project: Hadoop YARN Issue Type: Bug Components: scripts Affects Versions: 2.6.0 Reporter: Jeff Zhang Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3000.patch Currently I see YARN_PID_DIR only show in yarn-deamon.sh which is supposed not the place for user to set up enviroment variable. IMO, yarn-env.sh is the place for users to set up enviroment variable just like hadoop-env.sh, so it's better to put YARN_PID_DIR into yarn-env.sh. ( can put it into comment just like YARN_RESOURCEMANAGER_HEAPSIZE ) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3002) YARN documentation needs updating post-shell rewrite
[ https://issues.apache.org/jira/browse/YARN-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264502#comment-14264502 ] Steve Loughran commented on YARN-3002: -- +1 YARN documentation needs updating post-shell rewrite Key: YARN-3002 URL: https://issues.apache.org/jira/browse/YARN-3002 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 3.0.0 Reporter: Allen Wittenauer Attachments: YARN-3002-00.patch After HADOOP-9902, the YARN documentation is out of date. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264517#comment-14264517 ] Rohith commented on YARN-2684: -- Attached the patch fixinng this issue. Kindly review the patch. FairScheduler should tolerate queue configuration changes across RM restarts Key: YARN-2684 URL: https://issues.apache.org/jira/browse/YARN-2684 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2684.patch YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2684: - Attachment: 0001-YARN-2684.patch FairScheduler should tolerate queue configuration changes across RM restarts Key: YARN-2684 URL: https://issues.apache.org/jira/browse/YARN-2684 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2684.patch YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2625) Problems with CLASSPATH in Job Submission REST API
[ https://issues.apache.org/jira/browse/YARN-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264577#comment-14264577 ] Doug Haigh commented on YARN-2625: -- I am not using MR. I am writing my own AppMaster which requires knowing the path to the Hadoop ecosystem's CLASSPATH. Unless you expect Hadoop to be rewritten in something other than JAVA, most AppMaster jars written in Java will be required to know the CLASSPATH of the Hadoop ecosystem they are expected to run under. If you want to add a REST API to specifically get the CLASSPATH of the Hadoop ecosystem a Java AppMaster will run under, that is fine with me. Problems with CLASSPATH in Job Submission REST API -- Key: YARN-2625 URL: https://issues.apache.org/jira/browse/YARN-2625 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.5.1 Reporter: Doug Haigh There are a couple of issues I have found specifying the CLASSPATH environment variable using the REST API. 1) In the Java client, the CLASSPATH environment is usually made up of either the value of the yarn.application.classpath in yarn-site.xml value or the default YARN classpath value as defined by YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH. REST API consumers have no method of telling the resource manager to use the default unless they hardcode the default value themselves. If the default ever changes, the code would need to change. 2) If any environment variables are used in the CLASSPATH environment 'value' field, they are evaluated when the values are NULL resulting in bad values in the CLASSPATH. For example, if I had hardcoded the CLASSPATH value to the default of $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* the classpath passed to the application master is :/share/hadoop/common/*:/share/hadoop/common/lib/*:/share/hadoop/hdfs/*:/share/hadoop/hdfs/lib/*:/share/hadoop/yarn/*:/share/hadoop/yarn/lib/* These two problems require REST API consumers to always have the fully resolved path defined in the yarn.application.classpath value. If the property is missing or contains environment varaibles, the application created by the REST API will fail due to the CLASSPATH being incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2684) FairScheduler should tolerate queue configuration changes across RM restarts
[ https://issues.apache.org/jira/browse/YARN-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264581#comment-14264581 ] Hadoop QA commented on YARN-2684: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690068/0001-YARN-2684.patch against trunk revision 21c6f01. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6243//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6243//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6243//console This message is automatically generated. FairScheduler should tolerate queue configuration changes across RM restarts Key: YARN-2684 URL: https://issues.apache.org/jira/browse/YARN-2684 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2684.patch YARN-2308 fixes this issue for CS, this JIRA is to fix it for FS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2360) Fair Scheduler: Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264567#comment-14264567 ] Karthik Kambatla commented on YARN-2360: Thanks for noticing the missing commit, Peng. Just committed to branch-2. It should be part of 2.7. Fair Scheduler: Display dynamic fair share for queues on the scheduler page --- Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Fix For: 2.6.0 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch, yarn-2360-6.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2360) Fair Scheduler: Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2360: --- Fix Version/s: (was: 2.6.0) 2.7.0 Fair Scheduler: Display dynamic fair share for queues on the scheduler page --- Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Fix For: 2.7.0 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch, yarn-2360-6.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2978) ResourceManager crashes with NPE while getting queue info
[ https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264589#comment-14264589 ] Varun Saxena commented on YARN-2978: Kindly review. ResourceManager crashes with NPE while getting queue info - Key: YARN-2978 URL: https://issues.apache.org/jira/browse/YARN-2978 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Varun Saxena Priority: Critical Labels: capacityscheduler, resourcemanager Attachments: YARN-2978.001.patch, YARN-2978.002.patch, YARN-2978.003.patch java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625) at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA reassigned YARN-1177: --- Assignee: Akira AJISAKA (was: Wei Yan) Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Akira AJISAKA Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3005) [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java
[ https://issues.apache.org/jira/browse/YARN-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3005: --- Assignee: (was: Varun Saxena) [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java Key: YARN-3005 URL: https://issues.apache.org/jira/browse/YARN-3005 Project: Hadoop YARN Issue Type: Improvement Reporter: Akira AJISAKA Priority: Trivial Labels: newbie Since we have moved to JDK7, we can refactor the below if-else statement for String. {code} // TODO JDK7 SWITCH if (REGISTRY_CLIENT_AUTH_KERBEROS.equals(auth)) { access = AccessPolicy.sasl; } else if (REGISTRY_CLIENT_AUTH_DIGEST.equals(auth)) { access = AccessPolicy.digest; } else if (REGISTRY_CLIENT_AUTH_ANONYMOUS.equals(auth)) { access = AccessPolicy.anon; } else { throw new ServiceStateException(E_UNKNOWN_AUTHENTICATION_MECHANISM + \ + auth + \); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3005) [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java
Akira AJISAKA created YARN-3005: --- Summary: [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java Key: YARN-3005 URL: https://issues.apache.org/jira/browse/YARN-3005 Project: Hadoop YARN Issue Type: Improvement Reporter: Akira AJISAKA Priority: Trivial Since we have moved to JDK7, we can refactor the below if-else statement for String. {code} // TODO JDK7 SWITCH if (REGISTRY_CLIENT_AUTH_KERBEROS.equals(auth)) { access = AccessPolicy.sasl; } else if (REGISTRY_CLIENT_AUTH_DIGEST.equals(auth)) { access = AccessPolicy.digest; } else if (REGISTRY_CLIENT_AUTH_ANONYMOUS.equals(auth)) { access = AccessPolicy.anon; } else { throw new ServiceStateException(E_UNKNOWN_AUTHENTICATION_MECHANISM + \ + auth + \); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264785#comment-14264785 ] Akira AJISAKA commented on YARN-1177: - Hi [~ywskycn], how is the issue going? I want to take it over. Automatic failover has been already supported, however, when executing graceful failover from command line, UnsupportedOperationException is thrown. {code:title=RMHAServiceTarget.java} @Override public InetSocketAddress getZKFCAddress() { // TODO (YARN-1177): ZKFC implementation throw new UnsupportedOperationException(RMHAServiceTarget doesn't have + a corresponding ZKFC address); } {code} Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Wei Yan Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264789#comment-14264789 ] Wei Yan commented on YARN-1177: --- [~ajisakaa], feel free to grab. thanks. Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Wei Yan Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3005) [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java
[ https://issues.apache.org/jira/browse/YARN-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3005: -- Assignee: Varun Saxena [JDK7] Use switch statement for String instead of if-else statement in RegistrySecurity.java Key: YARN-3005 URL: https://issues.apache.org/jira/browse/YARN-3005 Project: Hadoop YARN Issue Type: Improvement Reporter: Akira AJISAKA Assignee: Varun Saxena Priority: Trivial Labels: newbie Since we have moved to JDK7, we can refactor the below if-else statement for String. {code} // TODO JDK7 SWITCH if (REGISTRY_CLIENT_AUTH_KERBEROS.equals(auth)) { access = AccessPolicy.sasl; } else if (REGISTRY_CLIENT_AUTH_DIGEST.equals(auth)) { access = AccessPolicy.digest; } else if (REGISTRY_CLIENT_AUTH_ANONYMOUS.equals(auth)) { access = AccessPolicy.anon; } else { throw new ServiceStateException(E_UNKNOWN_AUTHENTICATION_MECHANISM + \ + auth + \); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2919) Potential race between renew and cancel in DelegationTokenRenwer
[ https://issues.apache.org/jira/browse/YARN-2919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265163#comment-14265163 ] Karthik Kambatla commented on YARN-2919: [~Naganarasimha] - I am on vacation and will not be able to look into this at least until next week. Sorry for the inconvenience. Potential race between renew and cancel in DelegationTokenRenwer - Key: YARN-2919 URL: https://issues.apache.org/jira/browse/YARN-2919 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Naganarasimha G R Priority: Critical Attachments: YARN-2919.20141209-1.patch YARN-2874 fixes a deadlock in DelegationTokenRenewer, but there is still a race because of which a renewal in flight isn't interrupted by a cancel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3006) Improve the error message when attempting manual failover with auto-failover enabled
[ https://issues.apache.org/jira/browse/YARN-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265172#comment-14265172 ] Masatake Iwasaki commented on YARN-3006: Is YARN-2807 relevant? Improve the error message when attempting manual failover with auto-failover enabled Key: YARN-3006 URL: https://issues.apache.org/jira/browse/YARN-3006 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Minor When executing manual failover with automatic failover enabled, UnsupportedOperationException is thrown. {code} # yarn rmadmin -failover rm1 rm2 Exception in thread main java.lang.UnsupportedOperationException: RMHAServiceTarget doesn't have a corresponding ZKFC address at org.apache.hadoop.yarn.client.RMHAServiceTarget.getZKFCAddress(RMHAServiceTarget.java:51) at org.apache.hadoop.ha.HAServiceTarget.getZKFCProxy(HAServiceTarget.java:94) at org.apache.hadoop.ha.HAAdmin.gracefulFailoverThroughZKFCs(HAAdmin.java:311) at org.apache.hadoop.ha.HAAdmin.failover(HAAdmin.java:282) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:449) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:378) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:482) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:622) {code} I'm thinking the above message is confusing to users. (Users may think whether ZKFC is configured correctly...) The command should output error message to stderr instead of throwing Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265186#comment-14265186 ] Jian He commented on YARN-2230: --- [~kawaa], thanks for your patch! looks good overall. Would mind fixing the doc for the memory and also for the min-allocation too ? thx Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Priority: Minor Attachments: YARN-2230.001.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2360) Fair Scheduler: Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265208#comment-14265208 ] Hudson commented on YARN-2360: -- FAILURE: Integrated in Hadoop-trunk-Commit #6809 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6809/]) Move YARN-2360 from 2.6 to 2.7 in CHANGES.txt (kasha: rev 41d72cbd48e6df7be3d177eaf04d73e88cf38381) * hadoop-yarn-project/CHANGES.txt Fair Scheduler: Display dynamic fair share for queues on the scheduler page --- Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Fix For: 2.7.0 Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, Screen_Shot_v3.png, Screen_Shot_v4.png, Screen_Shot_v5.png, YARN-2360-v1.txt, YARN-2360-v2.txt, YARN-2360-v3.patch, YARN-2360-v4.patch, YARN-2360-v5.patch, yarn-2360-6.patch Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2881) Implement PlanFollower for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265259#comment-14265259 ] Karthik Kambatla commented on YARN-2881: I can remove the TODO at commit time. +1 for the patch, committing it now.. Implement PlanFollower for FairScheduler Key: YARN-2881 URL: https://issues.apache.org/jira/browse/YARN-2881 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2881.001.patch, YARN-2881.002.patch, YARN-2881.002.patch, YARN-2881.003.patch, YARN-2881.004.patch, YARN-2881.005.patch, YARN-2881.006.patch, YARN-2881.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265276#comment-14265276 ] Vijay Bhat commented on YARN-2230: -- Jian, I (Vijay Bhat) can definitely take care of that. Also, since this is my first submission, I wanted to clarify - is the protocol that I assign the JIRA to myself once I submit the patch? Apologies for any confusion. Thanks! Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Priority: Minor Attachments: YARN-2230.001.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3008) FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler
Anubhav Dhoot created YARN-3008: --- Summary: FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler Key: YARN-3008 URL: https://issues.apache.org/jira/browse/YARN-3008 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Instead of a big monolithic lock on FairScheduler, we can have an explicit lock on queuemanager and revisit all synchronized methods in FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265254#comment-14265254 ] Karthik Kambatla commented on YARN-2217: Can we also add error cases to the tests. I understand the methods are very simple, but it would be nice to catch changes in behavior. Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2978) ResourceManager crashes with NPE while getting queue info
[ https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265261#comment-14265261 ] Hadoop QA commented on YARN-2978: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690124/YARN-2978.004.patch against trunk revision dfd2589. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6246//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6246//console This message is automatically generated. ResourceManager crashes with NPE while getting queue info - Key: YARN-2978 URL: https://issues.apache.org/jira/browse/YARN-2978 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Varun Saxena Priority: Critical Labels: capacityscheduler, resourcemanager Attachments: YARN-2978.001.patch, YARN-2978.002.patch, YARN-2978.003.patch, YARN-2978.004.patch java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625) at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2427) Add support for moving apps between queues in RM web services
[ https://issues.apache.org/jira/browse/YARN-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265310#comment-14265310 ] Zhijie Shen commented on YARN-2427: --- Thanks for the patch, Varun! Some comments: 1. Is GET operation necessary for queue info alone? We have already provided getting app API, which contains the queue information. It may be arguable that this response is smaller in terms of message size. However, full app meta-data should not be that bad, shouldn't it? {code} + @GET + @Path(/apps/{appid}/queue) + @Produces({ MediaType.APPLICATION_JSON, MediaType.APPLICATION_XML }) + public AppQueue getAppQueue(@Context HttpServletRequest hsr, + @PathParam(appid) String appId) throws AuthorizationException { {code} 2. I assume the similar logic should be done in ClientRMService#moveApplicationAcrossQueues. If not, that method needs to be fixed. Here, we may not want to do the sanity check twice. {code} +String userName = callerUGI.getUserName(); +RMApp app = null; +try { + app = getRMAppForAppId(appId); +} catch (NotFoundException e) { + RMAuditLogger.logFailure(userName, AuditConstants.KILL_APP_REQUEST, +UNKNOWN, RMWebService, Trying to move an absent application ++ appId); + throw e; +} + +if (!app.getQueue().equals(targetQueue.getQueue())) { + // user is attempting to change queue. + return moveApp(app, callerUGI, targetQueue.getQueue()); +} {code} 3. If we avoided the logic in (2), we may have to handle ApplicationNotFoundException from ClientRMService#moveApplicationAcrossQueues and map it to NotFoundException around the following code. {code} + if (ue.getCause() instanceof YarnException) { +YarnException ye = (YarnException) ue.getCause(); {code} 4. Make it protected static? {code} + String appQueueToJSON(AppQueue targetQueue) throws Exception { {code} 5. JAXBContextResolver needs to add AppQueue. 6. Is the code change in TestFifoScheduler and testAppSubmit necessary? Add support for moving apps between queues in RM web services - Key: YARN-2427 URL: https://issues.apache.org/jira/browse/YARN-2427 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2427.0.patch, apache-yarn-2427.1.patch, apache-yarn-2427.2.patch, apache-yarn-2427.3.patch Support for moving apps from one queue to another is now present in CapacityScheduler and FairScheduler. We should expose the functionality via RM web services as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2574) Add support for FairScheduler to the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265318#comment-14265318 ] Hudson commented on YARN-2574: -- FAILURE: Integrated in Hadoop-trunk-Commit #6812 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6812/]) YARN-2881. [YARN-2574] Implement PlanFollower for FairScheduler. (Anubhav Dhoot via kasha) (kasha: rev 0c4b11267717eb451fa6ed4c586317f2db32fbd5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractSchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/ReservationQueueConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestFairReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/ReservationSystemTestUtil.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestFairSchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestCapacitySchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/FairSchedulerPlanFollower.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerDynamicBehavior.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/ReservationConstants.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/PlanQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java Add support for FairScheduler to the ReservationSystem -- Key: YARN-2574 URL: https://issues.apache.org/jira/browse/YARN-2574 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Subru Krishnan Assignee: Anubhav Dhoot YARN-1051 introduces the ReservationSystem and the current implementation is based on CapacityScheduler. This JIRA proposes adding support for FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2881) Implement PlanFollower for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265317#comment-14265317 ] Hudson commented on YARN-2881: -- FAILURE: Integrated in Hadoop-trunk-Commit #6812 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6812/]) YARN-2881. [YARN-2574] Implement PlanFollower for FairScheduler. (Anubhav Dhoot via kasha) (kasha: rev 0c4b11267717eb451fa6ed4c586317f2db32fbd5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractSchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/ReservationQueueConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestFairReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/AbstractReservationSystem.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/ReservationSystemTestUtil.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestFairSchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerPreemption.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/TestCapacitySchedulerPlanFollower.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/FairSchedulerPlanFollower.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AllocationConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/QueueManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerDynamicBehavior.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/reservation/ReservationConstants.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/PlanQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java Implement PlanFollower for FairScheduler Key: YARN-2881 URL: https://issues.apache.org/jira/browse/YARN-2881 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-2881.001.patch, YARN-2881.002.patch, YARN-2881.002.patch, YARN-2881.003.patch, YARN-2881.004.patch, YARN-2881.005.patch, YARN-2881.006.patch, YARN-2881.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265154#comment-14265154 ] Hudson commented on YARN-2958: -- FAILURE: Integrated in Hadoop-trunk-Commit #6808 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6808/]) YARN-2958. Made RMStateStore not update the last sequence number when updating the delegation token. Contributed by Varun Saxena. (zjshen: rev 562a701945be3a672f9cb5a52cc6db2c1589ba2b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreRMDTEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/LeveldbRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMDelegationTokenSecretManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/NullRMStateStore.java RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch, YARN-2958.004.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); //
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265409#comment-14265409 ] Varun Saxena commented on YARN-2958: Thanks [~jianhe] and [~zjshen] for the review. RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Fix For: 2.7.0 Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch, YARN-2958.004.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} All the stored identifiers will be gone through, and delegationTokenSequenceNumber will be set to the largest sequence number among these identifiers. Therefore, new DT will be assigned a sequence number which is always larger than that of all the recovered DT. To sum up, two negatives make a positive, but it's good to fix the issue. Please let me know if I've missed something here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2978) ResourceManager crashes with NPE while getting queue info
[ https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265407#comment-14265407 ] Varun Saxena commented on YARN-2978: Test failure unrelated. Passing in local ResourceManager crashes with NPE while getting queue info - Key: YARN-2978 URL: https://issues.apache.org/jira/browse/YARN-2978 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Varun Saxena Priority: Critical Labels: capacityscheduler, resourcemanager Attachments: YARN-2978.001.patch, YARN-2978.002.patch, YARN-2978.003.patch, YARN-2978.004.patch java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625) at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3000) YARN_PID_DIR should be visible in yarn-env.sh
[ https://issues.apache.org/jira/browse/YARN-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265426#comment-14265426 ] Jeff Zhang commented on YARN-3000: -- [~aw] THanks for clarification, then it make sense to make YARN_PID_DIR deprecated. YARN_PID_DIR should be visible in yarn-env.sh - Key: YARN-3000 URL: https://issues.apache.org/jira/browse/YARN-3000 Project: Hadoop YARN Issue Type: Bug Components: scripts Affects Versions: 2.6.0 Reporter: Jeff Zhang Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3000.patch Currently I see YARN_PID_DIR only show in yarn-deamon.sh which is supposed not the place for user to set up enviroment variable. IMO, yarn-env.sh is the place for users to set up enviroment variable just like hadoop-env.sh, so it's better to put YARN_PID_DIR into yarn-env.sh. ( can put it into comment just like YARN_RESOURCEMANAGER_HEAPSIZE ) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay Bhat updated YARN-2230: - Attachment: YARN-2230.002.patch Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Priority: Minor Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2637: -- Attachment: YARN-2637.26.patch reformatted some sections of testleafqueue, commenting the null check for rmcontext.getscheduler in ficaschedulerapp to see how widespread that condition is in the tests. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265572#comment-14265572 ] Chengbing Liu commented on YARN-2997: - {quote} I think this is not possible given that we are looping this.context.getContainers() which is based on containerId to Container map. Or we can just use a list. {quote} We are looping over {{context.getContainers()}}, plus possible remainders from the previous heartbeat (in case of a lost heartbeat). If the previously completed container has its status changed somehow, there would be two different ContainerStatus with same ID reported. That's why I use a map, and use {{pendingCompletedContainers.put(containerId, containerStatus)}} instead of {{containerStatuses.add(containerStatus)}} directly, in order to prevent such duplications {quote} then we should send the pendingCompletedContainers in getNMContainerStatuses method too {quote} We may not need to change {{getNMContainerStatuses}}, as it will send all container statuses in NM context, except the containers whose application is not in NM context. I think that will cover all elements in {{pendingCompletedContainers}}. And lost heartbeat is not a problem with {{getNMContainerStatuses}}. {quote} or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method. {quote} That's a good idea. I will change this in the next patch. Thanks for your advice! NM keeps sending finished containers to RM until app is finished Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3008) FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler
[ https://issues.apache.org/jira/browse/YARN-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3008: -- Assignee: Varun Saxena FairScheduler: Use lock for queuemanager instead of synchronized on FairScheduler - Key: YARN-3008 URL: https://issues.apache.org/jira/browse/YARN-3008 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Varun Saxena Instead of a big monolithic lock on FairScheduler, we can have an explicit lock on queuemanager and revisit all synchronized methods in FairScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265480#comment-14265480 ] Craig Welch commented on YARN-2637: --- bq. Regarding null checks in FiCaSchedulerApp. Since scheduler assumes application is in running state when adding FiCaSchedulerApp. It is a big issue if RMApp cannot be found at that time. So comparing to just ignore such error, I think you need throw exception (if that exception will not cause RM shutdown) and log such error. I'm not quite sure how to phrase this differently to get the point across - it is already the case throughout the many mocking points which interact with this code that the rmapp may be null at this point (if it were not the case it would not be necessary to check for it). As I mentioned previously, the ResourceManager itself checks for this case. I am not introducing the mocking which resulted in this state, or even existing checks for it in non-test code, I'm receiving this state and carrying it forward in the same way as it has been done elsewhere (and, again, not simply in tests). Changing this is not something which belongs in the scope of this jira because it represents a rationalization/overhaul of mocking throughout this area (resource manager, schedulers), it is non-trivial and not specific to or properly within the scope of this change. Feel free to create a separate jira to improve the mocking throughout the code. The separate null-check for the amresourcerequest is necessitated by the apparently intentional behavior of unmanaged am's. bq. And when this is possible? + if (rmContext.getScheduler() != null) again, in existing test paths, and existing code is tolerant of this as well, I'm merely carrying it forward - it would belong in the new jira as well, were one opened bq. \t in leafqueue - I've checked and the spacing is consistent with the existing spacing in the file. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265411#comment-14265411 ] Wangda Tan commented on YARN-2637: -- Hi [~cwelch], Thanks for updating, the latest patch looks much cleaner to me. Regarding null checks in FiCaSchedulerApp. Since scheduler assumes application is in running state when adding FiCaSchedulerApp. It is a big issue if RMApp cannot be found at that time. So comparing to just ignore such error, I think you need throw exception (if that exception will not cause RM shutdown) and log such error. And when this is possible? {code} + if (rmContext.getScheduler() != null) { +amResource = rmContext.getScheduler().getMinimumResourceCapability(); + } {code} If this is to address mock issue in tests, I suggest to modify test logic to avoid such changes. TestLeafQueue has some \t. Could you fix that please? Wangda maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265516#comment-14265516 ] Hadoop QA commented on YARN-2230: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690202/YARN-2230.002.patch against trunk revision 0c4b112. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6247//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6247//console This message is automatically generated. Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Priority: Minor Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265530#comment-14265530 ] Wangda Tan commented on YARN-2637: -- Regarding null check, I agree we can just go ahead and file a separated ticket to address them, it is not caused by your patch -- As you said, it is in the past the mocked tests forgot to set some fields but no one triggered that. But I think it will be helpful to solve at least the getScheduler() problem with this patch together. Since additional check will make people in the future maintaining the code spending more time think about why such checks exist. And I've just checked the TestLeafQueue has some existing \t, but not much lines (about 20 lines), can you reformat them in your patch? We shouldn't make our code consistent with previous bad code style. :) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1729) TimelineWebServices always passes primary and secondary filters as strings
[ https://issues.apache.org/jira/browse/YARN-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265474#comment-14265474 ] Chris K Wensel commented on YARN-1729: -- Just want to point out that if a value passed to TimelineWebServices#parsePairStr starts with a number, it is parsed as a number, causing the filter to fail if the value is stored as a string. That is, the rules for parsing the query string (String vs Object) are not consistent with the put methods that parse the JSON entity. ckw TimelineWebServices always passes primary and secondary filters as strings -- Key: YARN-1729 URL: https://issues.apache.org/jira/browse/YARN-1729 Project: Hadoop YARN Issue Type: Sub-task Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: 2.4.0 Attachments: YARN-1729.1.patch, YARN-1729.2.patch, YARN-1729.3.patch, YARN-1729.4.patch, YARN-1729.5.patch, YARN-1729.6.patch, YARN-1729.7.patch Primary filters and secondary filter values can be arbitrary json-compatible Object. The web services should determine if the filters specified as query parameters are objects or strings before passing them to the store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265556#comment-14265556 ] Jian He commented on YARN-2230: --- [~vijaysbhat], you can assign to yourself once you start working on it. I'm adding you to the contributor list. Also, when you submit the patch, it'll be good to summarize what the patch does so that others can understand the patch more easily. Summarizing the patch is also useful for history tracking. Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Priority: Minor Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be violated when resource of AM is minimumAllocation
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265625#comment-14265625 ] Hadoop QA commented on YARN-2637: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690221/YARN-2637.26.patch against trunk revision 0c4b112. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.reservation.TestCapacitySchedulerPlanFollower org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.TestFifoScheduler org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6248//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6248//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6248//console This message is automatically generated. maximum-am-resource-percent could be violated when resource of AM is minimumAllocation Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2881) Implement PlanFollower for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265695#comment-14265695 ] Subru Krishnan commented on YARN-2881: -- Thanks [~adhoot] for making sure that the reservation system works with FS also and [~kasha] for reviewing/committing it. Just for the record I did look at the last version of the patch with [~adhoot] today afternoon and looked good to me. Implement PlanFollower for FairScheduler Key: YARN-2881 URL: https://issues.apache.org/jira/browse/YARN-2881 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.7.0 Attachments: YARN-2881.001.patch, YARN-2881.002.patch, YARN-2881.002.patch, YARN-2881.003.patch, YARN-2881.004.patch, YARN-2881.005.patch, YARN-2881.006.patch, YARN-2881.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265663#comment-14265663 ] Chengbing Liu commented on YARN-2997: - [~jianhe] Can we perhaps deal with {{getNMContainerStatuses}} issue in another JIRA? This one has not changed anything for RM restart yet. If so, the only thing left is the {{pendingCompletedContainers.clear()}} thing. What do you think? NM keeps sending finished containers to RM until app is finished Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3009) TimelineWebServices always parses primary and secondary filters as numbers if first char is a number
[ https://issues.apache.org/jira/browse/YARN-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-3009: --- Assignee: Naganarasimha G R TimelineWebServices always parses primary and secondary filters as numbers if first char is a number Key: YARN-3009 URL: https://issues.apache.org/jira/browse/YARN-3009 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Chris K Wensel Assignee: Naganarasimha G R If you pass a filter value that starts with a number (7CCA...), the filter value will be parsed into the Number '7' causing the filter to fail the search. Should be noted the actual value as stored via a PUT operation is properly parsed and stored as a String. This manifests as a very hard to identify issue with DAGClient in Apache Tez and naming dags/vertices with alphanumeric guid values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3009) TimelineWebServices always parses primary and secondary filters as numbers if first char is a number
[ https://issues.apache.org/jira/browse/YARN-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265672#comment-14265672 ] Naganarasimha G R commented on YARN-3009: - Hi [~cwensel] I have assigned this issue to myself, if you want to work or already working on this please feel free to reassign. TimelineWebServices always parses primary and secondary filters as numbers if first char is a number Key: YARN-3009 URL: https://issues.apache.org/jira/browse/YARN-3009 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Chris K Wensel Assignee: Naganarasimha G R If you pass a filter value that starts with a number (7CCA...), the filter value will be parsed into the Number '7' causing the filter to fail the search. Should be noted the actual value as stored via a PUT operation is properly parsed and stored as a String. This manifests as a very hard to identify issue with DAGClient in Apache Tez and naming dags/vertices with alphanumeric guid values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265677#comment-14265677 ] Vijay Bhat commented on YARN-2230: -- Thanks Jian! I've submitted the updated patch. Patch summary: I've updated the text description for the following properties in yarn-default.xml to be reflective of the actual behavior in the code. * yarn.scheduler.minimum-allocation-vcores * yarn.scheduler.maximum-allocation-vcores * yarn.scheduler.minimum-allocation-mb * yarn.scheduler.maximum-allocation-mb This is a documentation patch and does not change any code behavior. Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265637#comment-14265637 ] Jian He commented on YARN-2997: --- bq. except the containers whose application is not in NM context. I think we should send containers whose application is not in NMContext too for recovery. NM keeps sending finished containers to RM until app is finished Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay Bhat reassigned YARN-2230: Assignee: Vijay Bhat Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2996) Refine some fs operations in FileSystemRMStateStore to improve performance
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265707#comment-14265707 ] Yi Liu commented on YARN-2996: -- Thanks [~zjshen] for review. You are right, for *.new* and *.tmp* file, the existing code uses them for some check. But actually the incompatible issue you mentioned is really rare and it's not a big issue. {{checkAndResumeUpdateOperation}} exists because we write state to *.tmp* file, then rename it to *.new* file, and finally rename to _output\_file_. If we remove step of renaming to *.new* file, we can remove this function too. Anyway, I will revert this modification. So in the new patch, I only keep the #1 described in description. I add two new fixes in the new patch: *1.* we missed *synchronized* for {{updateRMDelegationTokenState}} *2.* Add fix of YARN-3004 to this patch, since {{MemoryRMStateStore}} is only used in test and we can fix them in this patch too. Refine some fs operations in FileSystemRMStateStore to improve performance -- Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-2996.001.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3009) TimelineWebServices always parses primary and secondary filters as numbers if first char is a number
Chris K Wensel created YARN-3009: Summary: TimelineWebServices always parses primary and secondary filters as numbers if first char is a number Key: YARN-3009 URL: https://issues.apache.org/jira/browse/YARN-3009 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Chris K Wensel If you pass a filter value that starts with a number (7CCA...), the filter value will be parsed into the Number '7' causing the filter to fail the search. Should be noted the actual value as stored via a PUT operation is properly parsed and stored as a String. This manifests as a very hard to identify issue with DAGClient in Apache Tez and naming dags/vertices with alphanumeric guid values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2997: -- Assignee: Chengbing Liu NM keeps sending finished containers to RM until app is finished Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2996) Refine some fs operations in FileSystemRMStateStore to improve performance
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-2996: - Attachment: YARN-2996.002.patch Refine some fs operations in FileSystemRMStateStore to improve performance -- Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-2996.001.patch, YARN-2996.002.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2217) Shared cache client side changes
[ https://issues.apache.org/jira/browse/YARN-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264827#comment-14264827 ] Chris Trezzo commented on YARN-2217: Test failure seems unrelated. Shared cache client side changes Key: YARN-2217 URL: https://issues.apache.org/jira/browse/YARN-2217 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2217-trunk-v1.patch, YARN-2217-trunk-v2.patch, YARN-2217-trunk-v3.patch, YARN-2217-trunk-v4.patch, YARN-2217-trunk-v5.patch, YARN-2217-trunk-v6.patch Implement the client side changes for the shared cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264831#comment-14264831 ] Varun Saxena commented on YARN-2958: bq. If you take a look at the old addStoreOrUpdateOps. Storing DT and writing last sequence number is put in the same opList, hence both are executed or neither. [~zjshen], I had actually made it conditional in the patch i.e. sequence number will be put in the opList only if isUpdateSeqNo is enabled. Anyways, if we go by the assumption that If znode doesn't exist when updating, we suspet DT is not written, and neither does the sequence number, we do not need isUpdateSeqNo flag. I will make the change and upload a new patch. Thanks for the review. RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} All the stored identifiers will be gone through, and delegationTokenSequenceNumber will be set to the largest sequence number among these identifiers. Therefore, new DT will be assigned a sequence number which is always larger than that of all the recovered DT. To sum up, two negatives make a positive, but it's good to fix the issue. Please let me know if I've missed something here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-1177: Assignee: (was: Akira AJISAKA) Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264823#comment-14264823 ] Zhijie Shen commented on YARN-2958: --- bq. And if it doesnt exist we store it as a new token(not update it). In this case, I think we should not overwrite the sequence number. If you take a look at the old addStoreOrUpdateOps. Storing DT and writing last sequence number is put in the same opList, hence both are executed or neither. If znode doesn't exist when updating, we suspet DT is not written, and neither does the sequence number. RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} All the stored identifiers will be gone through, and delegationTokenSequenceNumber will be set to the largest sequence number among these identifiers. Therefore, new DT will be assigned a sequence number which is always larger than that of all the recovered DT. To sum up, two negatives make a positive, but it's good to fix the issue. Please let me know if I've missed something here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264842#comment-14264842 ] Akira AJISAKA commented on YARN-1177: - After rethinking this, I don't need ZKFC for RM failover. I really want to do is to improve the error message when attempting graceful failover without ZKFC. Feel free to take it over, thanks. Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Assignee: Akira AJISAKA Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265018#comment-14265018 ] Zhijie Shen commented on YARN-2958: --- The last patch looks good to me. [~jianhe], do you have any further comments? Otherwise, I'll commit the patch late today. In AbstractDelegationTokenSecretManager, the following code should be no longer useful in YARN scope. However, in case other impl of AbstractDelegationTokenSecretManager, doesn't store last sequence number separately, let's still keep this logic. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch, YARN-2958.004.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} All the stored identifiers will be gone through, and delegationTokenSequenceNumber will be set to the largest sequence number among these identifiers. Therefore, new DT will be assigned a sequence number which is always larger than that of all the recovered DT. To sum up, two negatives make a positive, but it's good to fix the issue. Please let me know if I've missed something here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265043#comment-14265043 ] Hadoop QA commented on YARN-2958: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690110/YARN-2958.004.patch against trunk revision dfd2589. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6244//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6244//console This message is automatically generated. RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch, YARN-2958.004.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) {
[jira] [Assigned] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
[ https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter reassigned YARN-2716: --- Assignee: (was: Robert Kanter) Sure, go ahead. Refactor ZKRMStateStore retry code with Apache Curator -- Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3006) Improve the error message when attempting manual failover with auto-failover enabled
Akira AJISAKA created YARN-3006: --- Summary: Improve the error message when attempting manual failover with auto-failover enabled Key: YARN-3006 URL: https://issues.apache.org/jira/browse/YARN-3006 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Minor When executing manual failover with automatic failover enabled, UnsupportedOperationException is thrown. {code} # yarn rmadmin -failover rm1 rm2 Exception in thread main java.lang.UnsupportedOperationException: RMHAServiceTarget doesn't have a corresponding ZKFC address at org.apache.hadoop.yarn.client.RMHAServiceTarget.getZKFCAddress(RMHAServiceTarget.java:51) at org.apache.hadoop.ha.HAServiceTarget.getZKFCProxy(HAServiceTarget.java:94) at org.apache.hadoop.ha.HAAdmin.gracefulFailoverThroughZKFCs(HAAdmin.java:311) at org.apache.hadoop.ha.HAAdmin.failover(HAAdmin.java:282) at org.apache.hadoop.ha.HAAdmin.runCmd(HAAdmin.java:449) at org.apache.hadoop.ha.HAAdmin.run(HAAdmin.java:378) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:482) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:622) {code} I'm thinking the above message is confusing to users. (Users may think whether ZKFC is configured correctly...) The command should output error message to stderr instead of throwing Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1177) Support automatic failover using ZKFC
[ https://issues.apache.org/jira/browse/YARN-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264897#comment-14264897 ] Akira AJISAKA commented on YARN-1177: - Filed YARN-3006 to improve the error message. Support automatic failover using ZKFC - Key: YARN-1177 URL: https://issues.apache.org/jira/browse/YARN-1177 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Karthik Kambatla Attachments: yarn-1177-ancient-version.patch Prior to embedding leader election and failover controller in the RM (YARN-1029), it might be a good idea to use ZKFC for a first-cut automatic failover implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2996) Refine some fs operations in FileSystemRMStateStore to improve performance
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264901#comment-14264901 ] Zhijie Shen commented on YARN-2996: --- bq. we can merge them to save one RPC call It sounds a good idea bq. we can reduce one rename operation. This will affect the mechanism that we use *.new* file to recover the actual state file when recovering RM. It needs to be take care of, too. Perhaps we can't simply remove the logic to recover actual state file from *.new* file, and I can think of a rare incompatible issue. See the following procedure: 1. Old FS RMStateStore writes state file and fails after .new file is created. 2. RM stops. 3. RM is upgraded and so does FS RMStateStore. 4. RM starts again. 5. New FS RMStateStore will not recover *.new* file, and may mistake it as a normal file. Refine some fs operations in FileSystemRMStateStore to improve performance -- Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-2996.001.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2958: --- Attachment: YARN-2958.004.patch RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch, YARN-2958.004.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} All the stored identifiers will be gone through, and delegationTokenSequenceNumber will be set to the largest sequence number among these identifiers. Therefore, new DT will be assigned a sequence number which is always larger than that of all the recovered DT. To sum up, two negatives make a positive, but it's good to fix the issue. Please let me know if I've missed something here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2978) ResourceManager crashes with NPE while getting queue info
[ https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264912#comment-14264912 ] Jian He commented on YARN-2978: --- looks good, I think we may just synchronize AbstractQueue#getQueueInfo because the caller is synchronized anyways or use the AbstractQueue#getUsedCapacity method so as to avoid adding usedCapacity to the exclusion list ? ResourceManager crashes with NPE while getting queue info - Key: YARN-2978 URL: https://issues.apache.org/jira/browse/YARN-2978 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Varun Saxena Priority: Critical Labels: capacityscheduler, resourcemanager Attachments: YARN-2978.001.patch, YARN-2978.002.patch, YARN-2978.003.patch java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625) at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2978) ResourceManager crashes with NPE while getting queue info
[ https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2978: --- Attachment: YARN-2978.004.patch ResourceManager crashes with NPE while getting queue info - Key: YARN-2978 URL: https://issues.apache.org/jira/browse/YARN-2978 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Varun Saxena Priority: Critical Labels: capacityscheduler, resourcemanager Attachments: YARN-2978.001.patch, YARN-2978.002.patch, YARN-2978.003.patch, YARN-2978.004.patch java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625) at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264975#comment-14264975 ] Jian He commented on YARN-2997: --- bq. we may run into a situation where we report two different ContainerStatus with same ID I think this is not possible given that we are looping {{this.context.getContainers()}} which is based on containerId to Container map. Or we can just use a list. bq. So in my opinion, that could be a potential leak. I see. then we should send the pendingCompletedContainers in getNMContainerStatuses method too. and {{pendingCompletedContainers.clear()}}; should be put after {{if (response.getNodeAction() == NodeAction.RESYNC) }}, or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as to avoid the newly added method. NM keeps sending finished containers to RM until app is finished Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3000) YARN_PID_DIR should be visible in yarn-env.sh
[ https://issues.apache.org/jira/browse/YARN-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264709#comment-14264709 ] Allen Wittenauer commented on YARN-3000: It was part of HADOOP-9902, thus why it isn't in branch-2 at all, much less 2.6 YARN_PID_DIR should be visible in yarn-env.sh - Key: YARN-3000 URL: https://issues.apache.org/jira/browse/YARN-3000 Project: Hadoop YARN Issue Type: Bug Components: scripts Affects Versions: 2.6.0 Reporter: Jeff Zhang Assignee: Rohith Priority: Minor Attachments: 0001-YARN-3000.patch Currently I see YARN_PID_DIR only show in yarn-deamon.sh which is supposed not the place for user to set up enviroment variable. IMO, yarn-env.sh is the place for users to set up enviroment variable just like hadoop-env.sh, so it's better to put YARN_PID_DIR into yarn-env.sh. ( can put it into comment just like YARN_RESOURCEMANAGER_HEAPSIZE ) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2978) ResourceManager crashes with NPE while getting queue info
[ https://issues.apache.org/jira/browse/YARN-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265074#comment-14265074 ] Hadoop QA commented on YARN-2978: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690124/YARN-2978.004.patch against trunk revision dfd2589. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6245//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6245//console This message is automatically generated. ResourceManager crashes with NPE while getting queue info - Key: YARN-2978 URL: https://issues.apache.org/jira/browse/YARN-2978 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Varun Saxena Priority: Critical Labels: capacityscheduler, resourcemanager Attachments: YARN-2978.001.patch, YARN-2978.002.patch, YARN-2978.003.patch, YARN-2978.004.patch java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto.isInitialized(YarnProtos.java:29625) at org.apache.hadoop.yarn.proto.YarnProtos$QueueInfoProto$Builder.build(YarnProtos.java:29939) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.mergeLocalToProto(QueueInfoPBImpl.java:290) at org.apache.hadoop.yarn.api.records.impl.pb.QueueInfoPBImpl.getProto(QueueInfoPBImpl.java:157) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.convertToProtoFormat(GetQueueInfoResponsePBImpl.java:128) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToBuilder(GetQueueInfoResponsePBImpl.java:104) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.mergeLocalToProto(GetQueueInfoResponsePBImpl.java:111) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetQueueInfoResponsePBImpl.getProto(GetQueueInfoResponsePBImpl.java:53) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:235) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3007) TestNMWebServices#testContainerLogs fails intermittently
Sangjin Lee created YARN-3007: - Summary: TestNMWebServices#testContainerLogs fails intermittently Key: YARN-3007 URL: https://issues.apache.org/jira/browse/YARN-3007 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Minor TestNMWebServices#testContainerLogs fails intermittently with JDK 7: {noformat} java.lang.AssertionError: Failed to create log dir at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices.testContainerLogs(TestNMWebServices.java:336) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2958) RMStateStore seems to unnecessarily and wrongly store sequence number separately
[ https://issues.apache.org/jira/browse/YARN-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265118#comment-14265118 ] Jian He commented on YARN-2958: --- lgtm, thanks [~varun_saxena] and [~zjshen] ! RMStateStore seems to unnecessarily and wrongly store sequence number separately Key: YARN-2958 URL: https://issues.apache.org/jira/browse/YARN-2958 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Zhijie Shen Assignee: Varun Saxena Priority: Blocker Attachments: YARN-2958.001.patch, YARN-2958.002.patch, YARN-2958.003.patch, YARN-2958.004.patch It seems that RMStateStore updates last sequence number when storing or updating each individual DT, to recover the latest sequence number when RM restarting. First, the current logic seems to be problematic: {code} public synchronized void updateRMDelegationTokenAndSequenceNumber( RMDelegationTokenIdentifier rmDTIdentifier, Long renewDate, int latestSequenceNumber) { if(isFencedState()) { LOG.info(State store is in Fenced state. Can't update RM Delegation Token.); return; } try { updateRMDelegationTokenAndSequenceNumberInternal(rmDTIdentifier, renewDate, latestSequenceNumber); } catch (Exception e) { notifyStoreOperationFailed(e); } } {code} {code} @Override protected void updateStoredToken(RMDelegationTokenIdentifier id, long renewDate) { try { LOG.info(updating RMDelegation token with sequence number: + id.getSequenceNumber()); rmContext.getStateStore().updateRMDelegationTokenAndSequenceNumber(id, renewDate, id.getSequenceNumber()); } catch (Exception e) { LOG.error(Error in updating persisted RMDelegationToken with sequence number: + id.getSequenceNumber()); ExitUtil.terminate(1, e); } } {code} According to code above, even when renewing a DT, the last sequence number is updated in the store, which is wrong. For example, we have the following sequence: 1. Get DT 1 (seq = 1) 2. Get DT 2( seq = 2) 3. Renew DT 1 (seq = 1) 4. Restart RM The stored and then recovered last sequence number is 1. It makes the next created DT after RM restarting will conflict with DT 2 on sequence num. Second, the aforementioned bug doesn't happen actually, because the recovered last sequence num has been overwritten at by the correctly one. {code} public void recover(RMState rmState) throws Exception { LOG.info(recovering RMDelegationTokenSecretManager.); // recover RMDTMasterKeys for (DelegationKey dtKey : rmState.getRMDTSecretManagerState() .getMasterKeyState()) { addKey(dtKey); } // recover RMDelegationTokens MapRMDelegationTokenIdentifier, Long rmDelegationTokens = rmState.getRMDTSecretManagerState().getTokenState(); this.delegationTokenSequenceNumber = rmState.getRMDTSecretManagerState().getDTSequenceNumber(); for (Map.EntryRMDelegationTokenIdentifier, Long entry : rmDelegationTokens .entrySet()) { addPersistedDelegationToken(entry.getKey(), entry.getValue()); } } {code} The code above recovers delegationTokenSequenceNumber by reading the last sequence number in the store. It could be wrong. Fortunately, delegationTokenSequenceNumber updates it to the right number. {code} if (identifier.getSequenceNumber() getDelegationTokenSeqNum()) { setDelegationTokenSeqNum(identifier.getSequenceNumber()); } {code} All the stored identifiers will be gone through, and delegationTokenSequenceNumber will be set to the largest sequence number among these identifiers. Therefore, new DT will be assigned a sequence number which is always larger than that of all the recovered DT. To sum up, two negatives make a positive, but it's good to fix the issue. Please let me know if I've missed something here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3007) TestNMWebServices#testContainerLogs fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-3007. --- Resolution: Invalid This issue is not reproducible in 2.7.0 or in trunk. Closing. TestNMWebServices#testContainerLogs fails intermittently Key: YARN-3007 URL: https://issues.apache.org/jira/browse/YARN-3007 Project: Hadoop YARN Issue Type: Bug Components: test Affects Versions: 2.4.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Minor TestNMWebServices#testContainerLogs fails intermittently with JDK 7: {noformat} java.lang.AssertionError: Failed to create log dir at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.assertTrue(Assert.java:43) at org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices.testContainerLogs(TestNMWebServices.java:336) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)