[jira] [Commented] (YARN-2588) Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.
[ https://issues.apache.org/jira/browse/YARN-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170587#comment-14170587 ] Rohith commented on YARN-2588: -- Hi [~jianhe] , could you please review the patch whenever you get time. Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception. -- Key: YARN-2588 URL: https://issues.apache.org/jira/browse/YARN-2588 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.6.0, 2.5.1 Reporter: Rohith Assignee: Rohith Attachments: YARN-2588.patch Consider scenario where, StandBy RM is failed to transition to Active because of ZK exception(connectionLoss or SessionExpired). Then any further transition to Active for same RM does not move RM to Active state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2682: Attachment: YARN-2682.000.patch WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170629#comment-14170629 ] Hadoop QA commented on YARN-2682: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674709/YARN-2682.000.patch against trunk revision 5faaba0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5384//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5384//console This message is automatically generated. WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-570: --- Attachment: YARN-570.4.patch Thanks [~rchiang] for trying the patch. Update the patch to uniform the format to EEE MMM dd HH:mm:ss Z . Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.2.0 Reporter: Peng Zhang Assignee: Akira AJISAKA Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, YARN-570.3.patch, YARN-570.4.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
Beckham007 created YARN-2686: Summary: CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7 Key: YARN-2686 URL: https://issues.apache.org/jira/browse/YARN-2686 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Beckham007 CgroupsLCEResourcesHandler uses , to seprating resourcesOption. Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So container-executor would use the wrong path /sys/fs/cgroup/cpu as the container task file. It should be /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks. We should someother character instand of ,. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170667#comment-14170667 ] Hadoop QA commented on YARN-570: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674718/YARN-570.4.patch against trunk revision 5faaba0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5385//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5385//console This message is automatically generated. Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.2.0 Reporter: Peng Zhang Assignee: Akira AJISAKA Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, YARN-570.3.patch, YARN-570.4.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-90: -- Attachment: apache-yarn-90.9.patch Uploaded a new patch to address comments by [~mingma] and [~zxu]. bq. Nit: For SetString postCheckFullDirs = new HashSetString(fullDirs);. It doesn't have to create postCheckFullDirs. It can directly refer to fullDirs later. It was just to ease lookups - instead of searching through a list, lookup a set. If you feel strongly about it, I can change it. {quote} can change if (!postCheckFullDirs.contains(dir) postCheckOtherDirs.contains(dir)) { to if (postCheckOtherDirs.contains(dir)) { {quote} Fixed. {quote} change if (!postCheckOtherDirs.contains(dir) postCheckFullDirs.contains(dir)) { to if (postCheckFullDirs.contains(dir)) { {quote} Fixed. {quote} 3. in verifyDirUsingMkdir: Can we add int variable to file name to avoid loop forever(although it is a very small chance) like the following? long i = 0L; while (target.exists()) \{ randomDirName = RandomStringUtils.randomAlphanumeric(5) + i++; target = new File(dir, randomDirName); } {quote} Fixed. {quote} 4. in disksTurnedBad: Can we add break in the loop when disksFailed is true so we exit the loop earlier? if (!preCheckDirs.contains(dir)) \{ disksFailed = true; break; } {quote} Fixed. {quote} 5. in disksTurnedGood same as item 4: Can we add break in the loop when disksTurnedGood is true? {quote} Fixed. {quote} In function verifyDirUsingMkdir, target.exists(), target.mkdir() and FileUtils.deleteQuietly(target) is not atomic, What happen if another thread try to create the same directory(target)? {quote} verifyDirUsingMkdir is called by testDirs which is called by checkDirs() which is synchronized. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170711#comment-14170711 ] Hadoop QA commented on YARN-90: --- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674722/apache-yarn-90.9.patch against trunk revision 5faaba0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5386//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5386//console This message is automatically generated. NodeManager should identify failed disks becoming good back again - Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170716#comment-14170716 ] Remus Rusanu commented on YARN-2682: WSCE should behave the same as DCE. If getFirstApplicationDir() was removed from DCE and getApplicationDir() is used instead, then WSCE should also use getApplicationDir(), w/o a need to define getFirstApplicationDir() in WSCE. WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext
[ https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170809#comment-14170809 ] Hudson commented on YARN-2651: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/711/]) YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java * hadoop-yarn-project/CHANGES.txt Spin off the LogRollingInterval from LogAggregationContext -- Key: YARN-2651 URL: https://issues.apache.org/jira/browse/YARN-2651 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch Remove per-app rolling interval completely and then have nodemanager wake up every so often and upload old log files. The wake up time is per-NM configuration, and is decoupled with the actual app's log rolling interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170808#comment-14170808 ] Hudson commented on YARN-2641: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/711/]) YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. (Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java Decommission nodes on -refreshNodes instead of next NM-RM heartbeat --- Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.7.0 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, YARN-2641.002.patch, YARN-2641.003.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). This will be a much more serious issue: After RM is refreshed (refreshNodes), If the NM to be decommissioned is killed before NM sent heartbeat to RM. The RMNode will never be decommissioned in RM. The RMNode will only expire in RM after yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170818#comment-14170818 ] Hudson commented on YARN-2566: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/711/]) YARN-2566. DefaultContainerExecutor should pick a working directory randomly. (Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt DefaultContainerExecutor should pick a working directory randomly - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Fix For: 2.6.0 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, YARN-2566.008.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170816#comment-14170816 ] Hudson commented on YARN-2377: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/711/]) YARN-2377. Localization exception stack traces are not passed as diagnostic info. Contributed by Gera Shegalov (jlowe: rev a56ea0100215ecf2e1471a18812b668658197239) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java * hadoop-yarn-project/CHANGES.txt Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 2.6.0 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170812#comment-14170812 ] Hudson commented on YARN-2308: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/711/]) YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue (jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java Missing Changes.txt for YARN-2308 (jianhe: rev 178bc505da5d06d591a19aac13c040c6a9cf28ad) * hadoop-yarn-project/CHANGES.txt NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Fix For: 2.6.0 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170860#comment-14170860 ] Sunil G commented on YARN-2495: --- Hi All. I have a doubt here, 1. In distributed configuration, each NM can specify label in register/heartbeat(update). I am not sure check for Valid Label to happen in RM or NM. As per the current design, it looks like all valid checks are happening at RM. If any such node label is invalid as per RM, then how this will be reported back to NM? Error Handling? 2. If possible to change label at run time from NM, i think same existing interfaces are used (heartbeat). Do you feel this check will happen may be more frequent in RM than in a Centralized configuration? In centralized config, some command will be fired by admin to change labels. This may not be frequent. But imagine a 1000 node cluster, and then with changing labels per heartbeat, will this be a bottleneck? Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
[ https://issues.apache.org/jira/browse/YARN-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170898#comment-14170898 ] Wei Yan commented on YARN-2686: --- According to the document of Redhat 7, it is not recommended to use libcgroup to do the cpu isolation. So YARN-2194 is working on systemd-based solution. Will update a patch for that jira soon. CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7 - Key: YARN-2686 URL: https://issues.apache.org/jira/browse/YARN-2686 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Beckham007 CgroupsLCEResourcesHandler uses , to seprating resourcesOption. Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So container-executor would use the wrong path /sys/fs/cgroup/cpu as the container task file. It should be /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks. We should someother character instand of ,. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2552) Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only
[ https://issues.apache.org/jira/browse/YARN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170907#comment-14170907 ] Remus Rusanu commented on YARN-2552: Copying here the patch' apt.vm update: 'yarn.nodemanager.windows-secure-container-executor.local-dirs' should contain the nodemanager local dirs. hadoopwinutilsvc will allow only file operations under these directories. This should contain the same values as '${yarn.nodemanager.local-dirs}, ${yarn.nodemanager.log-dirs}' but note that hadoopwinutilsvc XML configuration processing does not do substitutions so the value must be the final value. All paths must be absolute and no environment variable substitution will be performed. The paths are compared LOCAL_INVARIANT case insensitive string comparison, the file path validated must start with one of the paths listed in local-dirs configuration. Use comma as path separator. Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only - Key: YARN-2552 URL: https://issues.apache.org/jira/browse/YARN-2552 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows, wsce Attachments: YARN-2552.1.patch YARN-2458 added file manipulation operations executed in an elevated context by hadoopwinutilsvc. W/o any constraint, the NM (or a hijacker that takes over the NM) can manipulate arbitrary OS files under highest possible privileges, an easy elevation attack vector. The service should only allow operations on files/directories that are under the configured NM localdirs. It should read this value from wsce-site.xml, as the yarn-site.xml cannot be trusted, being writable by Hadoop admins (YARN-2551 ensures wsce-site.xml is only writable by system Administrators, not Hadoop admins). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2056: - Attachment: YARN-2056.201410141330.txt I'm sorry. The previous patch was bad. This one compiles cleanly. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry
[ https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170953#comment-14170953 ] Hudson commented on YARN-2667: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/]) YARN-2667. Fix the release audit warning caused by hadoop-yarn-registry. Contributed by Yi Liu (jlowe: rev 344a10ad5e26c25abd62eda65eec2820bb808a74) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/pom.xml Fix the release audit warning caused by hadoop-yarn-registry Key: YARN-2667 URL: https://issues.apache.org/jira/browse/YARN-2667 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.6.0 Attachments: YARN-2667.001.patch ? /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep Lines that start with ? in the release audit report indicate files that do not have an Apache license header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext
[ https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170955#comment-14170955 ] Hudson commented on YARN-2651: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/]) YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java Spin off the LogRollingInterval from LogAggregationContext -- Key: YARN-2651 URL: https://issues.apache.org/jira/browse/YARN-2651 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch Remove per-app rolling interval completely and then have nodemanager wake up every so often and upload old log files. The wake up time is per-NM configuration, and is decoupled with the actual app's log rolling interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170962#comment-14170962 ] Hudson commented on YARN-2377: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/]) YARN-2377. Localization exception stack traces are not passed as diagnostic info. Contributed by Gera Shegalov (jlowe: rev a56ea0100215ecf2e1471a18812b668658197239) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 2.6.0 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170958#comment-14170958 ] Hudson commented on YARN-2308: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/]) YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue (jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java Missing Changes.txt for YARN-2308 (jianhe: rev 178bc505da5d06d591a19aac13c040c6a9cf28ad) * hadoop-yarn-project/CHANGES.txt NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Fix For: 2.6.0 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170964#comment-14170964 ] Hudson commented on YARN-2566: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/]) YARN-2566. DefaultContainerExecutor should pick a working directory randomly. (Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt DefaultContainerExecutor should pick a working directory randomly - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Fix For: 2.6.0 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, YARN-2566.008.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at
[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170954#comment-14170954 ] Hudson commented on YARN-2641: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/]) YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. (Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java Decommission nodes on -refreshNodes instead of next NM-RM heartbeat --- Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.7.0 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, YARN-2641.002.patch, YARN-2641.003.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). This will be a much more serious issue: After RM is refreshed (refreshNodes), If the NM to be decommissioned is killed before NM sent heartbeat to RM. The RMNode will never be decommissioned in RM. The RMNode will only expire in RM after yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned YARN-2314: Assignee: Jason Lowe bq. Basically the cache doesn't have more functionalities other than just cache the connection. It doesn't even do that, because if we cache the connection to the NM then we leak threads. When a cache entry is purged the RPC Client thread (tied to the NM socket connection) can linger because the RPC layer doesn't provide a way to force a connection to be closed due to protocol refcounting. We need to set the RPC idle timeout to 0 as a workaround to force the connections to close so we don't leak threads. Therefore all the cache is doing is caching the proxy objects with no connection behind them. Those objects will reconnect to the NM each time we make a call. Not sure saving the proxy objects themselves is worth it -- would be interesting to prove this cache helps in a meaningful way before we assume we need it. But I can update the patch to provide a config property to keep it anyway, hope to have that up later today. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2687) WindowsSecureContainerExecutor hadoopwinutilsvc is difficult to troubleshoot
Remus Rusanu created YARN-2687: -- Summary: WindowsSecureContainerExecutor hadoopwinutilsvc is difficult to troubleshoot Key: YARN-2687 URL: https://issues.apache.org/jira/browse/YARN-2687 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu The hadoopwinutilsvc logs using the NT service logging infrastructure (ie. Event Viewer). Ideally it should log within the Hadoop logging expected location/format, and be configured via the same parameters. As native C++ code it cannot leverage directly the log4j (and log4c++ is rather different config etc). I'm thinking that the hadoopwinutilsvc could establish a communication channel with NM itself and log via the NM. We already have the infrastructure in place (RPC, IDL etc). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171018#comment-14171018 ] Hadoop QA commented on YARN-2056: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674768/YARN-2056.201410141330.txt against trunk revision 5faaba0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5387//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5387//console This message is automatically generated. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171027#comment-14171027 ] Naganarasimha G R commented on YARN-2495: - Thanks [~aw],[~wangda] [~sunilg], *For [~aw] comments:* bq. I don't fully understand your question, but I'll admit I've been distracted with other JIRAs lately. You got first part of my question right and thanks for detailing the scenario bq. If we are rolling out a new version of the JDK, I shouldn't have to tell the system that it's ok to broadcast that JDK version first. I understood the use case but what i did not understand is how would it restrict/deter a user, as he can do one more updation ; one more label to the central valid label list, like java version or jdk version etc. As anyway script will be written/updated to get specific set of labels so i feel in most cases admin can know what lables will be coming in the cluster. Any other use case where it will be difficult for admin to list the labels before hand ? *For [~wangda] comments:* bq. they will be reported to RM when NM registration. We may not need to persist any of them, but RM should know these labels existence to do scheduling. Does RM needs to have all the list of valid labels even before registering of all nodes are done ? How will it impact scheduling ? How is it different from Central configuration ? As in centralized config; user needs to update the new labels and then send the node to lables mapping . similarly in distributed config first we can find out the new labels and update the super set list of lables and then update the label mapping for a node which wants update or modify labels. bq. Another question is if we need check labels when they registering, I prefer to pre-set them because this affects scheduling behavior. For example, the maximum-resource and minimum-resource are setup in RM side, and RackResolver is also run in RM side May be this i did not get it correctly. You mean when NM is registering for the first time after startup, you want it have preset apart from what is read from NM's yarn-site.xml/script ? did not get this clearly please elaborate. bq. At least, the label checking should be kept configurable in distributed mode. – just ignore all the labels for that node if invalid labels exists might be a good way when it enabled. in your earlier stmt you said it affects scheduling, if so then if its kept configurable then how will that solve ? But what was clear was * Support to add and remove Valid label and centralized level is required * RM will do the label validation on NM registraion heartbeat * If while validating (during NM registraion heartbeat) if one of the labels fail for a given node. then we will just ignore all the labels for that node. *For [~sunilg] comments:* bq. If any such node label is invalid as per RM, then how this will be reported back to NM? Error Handling? I too have the same doubt and feel that usability will be reduced as script is executed some where and the validations are happening some where, if error is not propagated back to NM. bq. But imagine a 1000 node cluster, and then with changing labels per heartbeat, will this be a bottleneck? we will not be changing label for every heart beat i will try to ensure that during heartbeat only if the labels have changed from previous set of labels for a node only then it will send the updated label set. But issue will be there that lot of contention will happen suppose some script is modified and all 2000 nodes want to update their labels Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info
[ https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171042#comment-14171042 ] Hudson commented on YARN-2377: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/]) YARN-2377. Localization exception stack traces are not passed as diagnostic info. Contributed by Gera Shegalov (jlowe: rev a56ea0100215ecf2e1471a18812b668658197239) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java * hadoop-yarn-project/CHANGES.txt Localization exception stack traces are not passed as diagnostic info - Key: YARN-2377 URL: https://issues.apache.org/jira/browse/YARN-2377 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov Assignee: Gera Shegalov Fix For: 2.6.0 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch In the Localizer log one can only see this kind of message {code} 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar, 1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos tException: ha-nn-uri-0 {code} And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is propagated as diagnostics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
[ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171038#comment-14171038 ] Hudson commented on YARN-2308: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/]) YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue (jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java Missing Changes.txt for YARN-2308 (jianhe: rev 178bc505da5d06d591a19aac13c040c6a9cf28ad) * hadoop-yarn-project/CHANGES.txt NPE happened when RM restart after CapacityScheduler queue configuration changed - Key: YARN-2308 URL: https://issues.apache.org/jira/browse/YARN-2308 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: chang li Priority: Critical Fix For: 2.6.0 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, jira2308.patch, jira2308.patch I encountered a NPE when RM restart {code} 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:744) {code} And RM will be failed to restart. This is caused by queue configuration changed, I removed some queues and added new queues. So when RM restarts, it tries to recover history applications, and when any of queues of these applications removed, NPE will be raised. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171034#comment-14171034 ] Hudson commented on YARN-2641: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/]) YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. (Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java Decommission nodes on -refreshNodes instead of next NM-RM heartbeat --- Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.7.0 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, YARN-2641.002.patch, YARN-2641.003.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). This will be a much more serious issue: After RM is refreshed (refreshNodes), If the NM to be decommissioned is killed before NM sent heartbeat to RM. The RMNode will never be decommissioned in RM. The RMNode will only expire in RM after yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext
[ https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171035#comment-14171035 ] Hudson commented on YARN-2651: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/]) YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto Spin off the LogRollingInterval from LogAggregationContext -- Key: YARN-2651 URL: https://issues.apache.org/jira/browse/YARN-2651 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch Remove per-app rolling interval completely and then have nodemanager wake up every so often and upload old log files. The wake up time is per-NM configuration, and is decoupled with the actual app's log rolling interval. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry
[ https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171033#comment-14171033 ] Hudson commented on YARN-2667: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/]) YARN-2667. Fix the release audit warning caused by hadoop-yarn-registry. Contributed by Yi Liu (jlowe: rev 344a10ad5e26c25abd62eda65eec2820bb808a74) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/pom.xml Fix the release audit warning caused by hadoop-yarn-registry Key: YARN-2667 URL: https://issues.apache.org/jira/browse/YARN-2667 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.6.0 Attachments: YARN-2667.001.patch ? /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep Lines that start with ? in the release audit report indicate files that do not have an Apache license header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171044#comment-14171044 ] Hudson commented on YARN-2566: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/]) YARN-2566. DefaultContainerExecutor should pick a working directory randomly. (Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a) * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt DefaultContainerExecutor should pick a working directory randomly - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Fix For: 2.6.0 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, YARN-2566.008.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171325#comment-14171325 ] Wangda Tan commented on YARN-2314: -- Hi [~jlowe], Thanks for your comment, I also agree just caching the proxy object itself may not be necessary. The behavior in my mind should be, admin can setup if the container management proxy is disabled. - If it is disabled, IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY will be set to 0 and all the cache logic will be disabled as what you have done in your patch. - If it is enabled, we should keep the existing behavior (or improve the LRU cache as other patch in this JIRA), but basically, it's better to keep it. I'm a little doubt about if there is any other potential bug if we completely remove it. Thanks, ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171332#comment-14171332 ] Steve Loughran commented on YARN-2689: -- Example of one failing test. {code} Running org.apache.hadoop.registry.secure.TestSecureRegistry Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 10.562 sec FAILURE! - in org.apache.hadoop.registry. secure.TestSecureRegistry testZookeeperCanWrite(org.apache.hadoop.registry.secure.TestSecureRegistry) Time elapsed: 0.344 sec ERROR! org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.zookeeper.server.ServerCnxnFactory.configureSaslLogin(ServerCnxnFactory.java:207) at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:87) at org.apache.hadoop.registry.server.services.MicroZookeeperService.serviceStart(MicroZookeeperService.java:237) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.registry.secure.AbstractSecureRegistryTest.startSecureZK(AbstractSecureRegistryTest.java:352) at org.apache.hadoop.registry.secure.TestSecureRegistry.testZookeeperCanWrite(TestSecureRegistry.java:83) {code} searching for string Unable to obtain password from user implies a common cause is principal name is spelled wrong and the lookup to whatever kerberos state there is fails TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2588) Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.
[ https://issues.apache.org/jira/browse/YARN-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171336#comment-14171336 ] Jian He commented on YARN-2588: --- Rohith, thanks for the patch, I have a couple of comments: - the stopActiveServices() call maybe not necessary ? as the AbstractService class internally should call stop if any exception occurs. {code} } catch (Exception e) { stopActiveServices(); {code} - maybe we can invoke following right before each time start the active services? {code} createAndInitActiveServices(); {code} - fix following code comment {code} // @Test(timeout = 3) {code} Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception. -- Key: YARN-2588 URL: https://issues.apache.org/jira/browse/YARN-2588 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.6.0, 2.5.1 Reporter: Rohith Assignee: Rohith Attachments: YARN-2588.patch Consider scenario where, StandBy RM is failed to transition to Active because of ZK exception(connectionLoss or SessionExpired). Then any further transition to Active for same RM does not move RM to Active state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171340#comment-14171340 ] Li Lu commented on YARN-2314: - Hi [~wangda], maybe we want to leave a note in the config, saying that enabling the RPC cache may cause problems for large cluster (so that people would know the possible side-effect of enabling this)? ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171343#comment-14171343 ] Steve Loughran commented on YARN-2689: -- output {code} 2014-10-14 11:31:50,756 [main] DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: classTeardown entered state INITED 2014-10-14 11:31:50,772 [main] DEBUG service.CompositeService (CompositeService.java:serviceInit(104)) - classTeardown: initing services, size=0 2014-10-14 11:31:50,772 [main] DEBUG service.CompositeService (CompositeService.java:serviceStart(115)) - classTeardown: starting services, size=0 2014-10-14 11:31:50,772 [main] DEBUG service.AbstractService (AbstractService.java:start(197)) - Service classTeardown is started 2014-10-14 11:31:50,772 [main] DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: registrySecurity entered state INITED 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(259)) - Configuration: 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(260)) - --- 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - debug: true 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - transport: TCP 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - max.ticket.lifetime: 8640 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - org.name: EXAMPLE 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - kdc.port: 0 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - org.domain: COM 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - max.renewable.lifetime: 60480 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - instance: DefaultKrbServer 2014-10-14 11:31:50,788 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - kdc.bind.address: localhost 2014-10-14 11:31:50,803 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(264)) - --- 2014-10-14 11:31:58,772 [main] INFO minikdc.MiniKdc (MiniKdc.java:initKDCServer(480)) - MiniKdc listening at port: 49764 2014-10-14 11:31:58,772 [main] INFO minikdc.MiniKdc (MiniKdc.java:initKDCServer(481)) - MiniKdc setting JVM krb5.conf to: C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\1413311510788\krb5.conf 2014-10-14 11:31:59,334 [main] INFO secure.AbstractSecureRegistryTest (AbstractSecureRegistryTest.java:setupKDCAndPrincipals(218)) - zookeeper { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab principal=zookeeper useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; ZOOKEEPER_SERVER { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab principal=zookeeper/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; alice { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\alice.keytab principal=alice/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; bob { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\bob.keytab principal=bob/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; 2014-10-14 11:31:59,479 [JUnit] INFO secure.AbstractSecureRegistryTest (AbstractSecureRegistryTest.java:login(325)) - Logging in as zookeeper/localhost in context ZOOKEEPER_SERVER with keytab target\kdc\zookeeper.keytab Debug is true storeKey true useTicketCache true useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab refreshKrb5Config is true principal is zookeeper/localhost tryFirstPass is false useFirstPass is false storePass is false clearPass is false Refreshing Kerberos configuration Acquire TGT from Cache Principal is zookeeper/localh...@example.com null credentials from Ticket Cache principal is zookeeper/localh...@example.com Will use keytab Commit Succeeded 2014-10-14 11:31:59,693 [JUnit] DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: test-testZookeeperCanWrite entered state INITED 2014-10-14 11:31:59,693 [JUnit] INFO secure.AbstractSecureRegistryTest
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171348#comment-14171348 ] Wangda Tan commented on YARN-2314: -- [~gtCarrera9], Agree, and the disabled should be default behavior. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171354#comment-14171354 ] Zhijie Shen commented on YARN-2656: --- Kick the jenkins again, as HADOOP-11181 is already committed. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2314: - Attachment: YARN-2314.patch Attaching a patch that allows the existing yarn.client.max-nodemanagers-proxies to be zero to indicate the proxy cache is disabled. Also per Wangda's comment the default is 0 (i.e.: cache is disabled). If disabled it sets the idle timeout to zero, otherwise it leaves it untouched and caches the proxy objects. The comment for the property was updated to also mention the issue with lingering connection threads and the potential for the cache to cause problems on large clusters. This patch also includes my earlier prototype fix to keep the cache from accidentally increasing in size if connections are busy. bq. I'm a little doubt about if there is any other potential bug if we completely remove it. I'm on the other side of that fence, since we ran for a long time on Hadoop 0.23 without this cache and did not see issues. We've already found two issues with the cache (grows above the specified size and accumulates lingering connection threads), and I have yet to see evidence it is needed. If anything there's some evidence to the contrary from us and Sangjin. But in case someone running on a smaller cluster really is depending upon this cache for some use case, the patch tries to let large clusters work yet small cluster users can turn on this cache. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171408#comment-14171408 ] Bikas Saha commented on YARN-2314: -- Folks, this is something that would be of interest in Tez since it uses the ContainerManagementProtocolProxy. My summary understanding is that the default is to turn this proxy off and this improves things for large scale clusters. So when Tez moves to 2.6 then it will automatically pick the defaults (which turn caching off) and benefit for large clusters. Is that correct? ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171420#comment-14171420 ] Jason Lowe commented on YARN-2314: -- Yes, the patch sets the default to off since that allows all cluster sizes to work. If it's crucial to default to enabled for small clusters then those with large clusters will have to manually configure the cache off. Again I have yet to see evidence this cache is necessary, so defaulting to something that doesn't fail for all cluster sizes seemed like a better choice than one which would work for some but not others. If you have evidence where Tez absolutely has to have this cache enabled that would be good to share. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2688) Better diagnostics on Container Launch failures
[ https://issues.apache.org/jira/browse/YARN-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171426#comment-14171426 ] Gera Shegalov commented on YARN-2688: - Localizer diagnostics was improved by YARN-2377. Better diagnostics on Container Launch failures --- Key: YARN-2688 URL: https://issues.apache.org/jira/browse/YARN-2688 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy We need better diagnostics on container launch failures due to errors like localizations issues, wrong command for container launch etc. Currently, if the container doesn't launch, we get nothing - not even container logs since there are no logs to aggregate either. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171429#comment-14171429 ] Bikas Saha commented on YARN-2314: -- To be clear, my question was only to clarify if Tez would get the benefits without doing anything because the defaults are correct. Looks like that is the case. Thanks! ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171443#comment-14171443 ] Jason Lowe commented on YARN-2314: -- So Tez will automatically benefit on large clusters because the default is to not use the cache. However if we've found empirically that Tez needs the proxy cache to perform well then this patch would be a performance hit for Tez by default on clusters where the cache issues weren't a problem. I wasn't sure which default benefit you were referring to above (running faster because cache is enabled or working on a large cluster because cache is disabled). If Tez shows significant improvements with this cache turned on then I could see an argument to have the cache on by default since small clusters are common and large clusters are rare. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171456#comment-14171456 ] Hadoop QA commented on YARN-2656: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674303/YARN-2656.4.patch against trunk revision cdce883. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.ha.TestZKFailoverControllerStress {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5388//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5388//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5388//console This message is automatically generated. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171460#comment-14171460 ] Vinod Kumar Vavilapalli commented on YARN-2496: --- More comments ParentQueue - assignToQueue(): We are only checking that at least one label is within maximum capacity. Bug? - assignToQueue() - canAssignToThisQueue - Not related to your patch, but removeApplication() can be private. Similarly assignContainersToChildQueues, printChildQueues. - Can avoid multiple calls to labelManager.getLabelsOnNode(node.getNodeID()) inside assignContainers. - getACLs() should be pushed up to AbstractQueue. - I think reservations are still not handled per accessible node-labels in the patch. We can fix it separately though. - Sorting queues doesn't take node-labels into account. Again, we can fix it separately. - Explicitly mark calls to allocateResource() and releaseResource with super for better readability. - We should change printChildQueues() and getChildQueuesToPrint() to print node-label associations too. - The following check in LeafQueue needs to be present in ParentQueue too? {code} // if our queue cannot access this node, just return if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels, labelManager.getLabelsOnNode(node.getNodeID( { return NULL_ASSIGNMENT; } {code} AbstractQueue - queueComparator should be pushed down to ParentQueue. - releaseResource() should be protected More to come. Changes for capacity scheduler to support allocate resource respect labels -- Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496-20141009-1.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171459#comment-14171459 ] Bikas Saha commented on YARN-2314: -- My understanding from the comments was that in most cases this cache was adding overhead without benefit since the RPC layer was not controlled by the cache. We have no empirical evidence either ways about the performance. If you know of cases where this change of default might cause issues, then it would be helpful if they were enumerated in a comment. Then Tez/other users could test for those cases when they upgrade to 2.6 and make their own choices. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171470#comment-14171470 ] Jason Lowe commented on YARN-2312: -- Sorry for the late reply. +1 lgtm as well. I noticed the patch doesn't apply cleanly to branch-2 because TestCheckpointPreemptionPolicy.java is missing from that branch. That made me wonder if there were any changes needed for branch-2 that aren't on trunk, but I didn't find any from a simple search. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch, YARN-2312.7.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171475#comment-14171475 ] Jason Lowe commented on YARN-2314: -- The only issue I can think of is the idle timeout change that goes along with the cache being disabled. Since we disable the cache by default we also, by default, set the cm proxy connection idle timeouts to zero. That means for each cm proxy RPC call we will create a new connection to the NM. That sounds expensive, and probably was the motivation for the creation of the cache, but in practice it doesn't seem to matter (at least for the loads we tested which didn't include Tez). For our case we were comparing 2.x against 0.23, and 0.23 was slightly faster in the AM scalability test than 2.x despite 2.x having this cache and 0.23 lacking it. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171476#comment-14171476 ] Hadoop QA commented on YARN-2314: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674827/YARN-2314.patch against trunk revision cdce883. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.client.TestRMFailover {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5389//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5389//console This message is automatically generated. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171512#comment-14171512 ] Steve Loughran commented on YARN-2689: -- low-level stack replicated {code} javax.security.auth.login.LoginException: Unable to obtain password from user at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:856) at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:719) at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:584) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:762) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:203) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:690) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:688) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:687) at javax.security.auth.login.LoginContext.login(LoginContext.java:595) at org.apache.zookeeper.Login.login(Login.java:292) at org.apache.zookeeper.Login.init(Login.java:93) at org.apache.hadoop.registry.secure.TestSecureRegistry.testLowlevelZKSaslLogin(TestSecureRegistry.java:81) {code} TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171535#comment-14171535 ] Ray Chiang commented on YARN-570: - One bug. The RM UI application table ends up with times like: Tue Oct 14 14:4:27 -0700 2014 (for 2:04 PM) Tue Oct 14 14:5:6 -0700 2014(for 2:05 PM) One comment. The RM About section always shows time local to the node the RM is running on. The RM UI application table always show local time of the machine/browser. That fits given the Javascript/Java discrepancy, but it could be confusing in a completely different way. Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.2.0 Reporter: Peng Zhang Assignee: Akira AJISAKA Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, YARN-570.3.patch, YARN-570.4.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1542) Add unit test for public resource on viewfs
[ https://issues.apache.org/jira/browse/YARN-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated YARN-1542: Attachment: YARN-1542.v05.patch v05: rebasing the patch again. Add unit test for public resource on viewfs --- Key: YARN-1542 URL: https://issues.apache.org/jira/browse/YARN-1542 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1542.v01.patch, YARN-1542.v02.patch, YARN-1542.v03.patch, YARN-1542.v04.patch, YARN-1542.v05.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171545#comment-14171545 ] Steve Loughran commented on YARN-2689: -- login is looking for domained principal {code} JVM krb5.conf to: C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\1413321345247\krb5.conf 2014-10-14 14:15:53,044 [main] INFO secure.AbstractSecureRegistryTest (AbstractSecureRegistryTest.java:setupKDCAndPrincipals(219)) - zookeeper { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab debug=true principal=zookeeper useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; ZOOKEEPER_SERVER { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab debug=true principal=zookeeper/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; alice { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\alice.keytab debug=true principal=alice/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; bob { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\bob.keytab debug=true principal=bob/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is C:Workhadoop-trunkhadoop-yarn-projecthadoop-yarnhadoop-yarn-registry argetkdczookeeper.keytab refreshKrb5Config is false principal is zookeeper/localhost tryFirstPass is false useFirstPass is false storePass is false clearPass is false Key for the principal zookeeper/localh...@example.com not available in C:Workhadoop-trunkhadoop-yarn-projecthadoop-yarnhadoop-yarn-registry argetkdczookeeper.keytab [Krb5LoginModule] authentication failed Unable to obtain password from user {code} TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171561#comment-14171561 ] Steve Loughran commented on YARN-2689: -- correction: login is looking for a keytab string that needs to be escaped in the jaas conf file TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171587#comment-14171587 ] Steve Loughran commented on YARN-2689: -- After fix ZK comes up registry not playing nice, permissions? {code} testUserZookeeperHomePathAccess(org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations) Time elapsed: 1.067 s ec ERROR! org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.registry.client.exceptions.AuthenticationFailedExcept ion: `/registry': Authentication Failed: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = Aut hFailed for /registry: KeeperErrorCode = AuthFailed for /registry at org.apache.zookeeper.KeeperException.create(KeeperException.java:123) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688) at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:423) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) at org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:544) at org.apache.hadoop.registry.client.impl.zk.CuratorService.maybeCreate(CuratorService.java:431) at org.apache.hadoop.registry.server.services.RegistryAdminService.createRootRegistryPaths(RegistryAdminService. java:246) at org.apache.hadoop.registry.server.services.RegistryAdminService.serviceStart(RegistryAdminService.java:215) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:10 5) at org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:97 ) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations.startRMRegistryOperations(TestSecureRMRegist ryOperations.java:96) at org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations.testUserZookeeperHomePathAccess(TestSecureRM RegistryOperations.java:226) {code} TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1542) Add unit test for public resource on viewfs
[ https://issues.apache.org/jira/browse/YARN-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171594#comment-14171594 ] Hadoop QA commented on YARN-1542: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674850/YARN-1542.v05.patch against trunk revision cdce883. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5390//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5390//console This message is automatically generated. Add unit test for public resource on viewfs --- Key: YARN-1542 URL: https://issues.apache.org/jira/browse/YARN-1542 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Gera Shegalov Assignee: Gera Shegalov Attachments: YARN-1542.v01.patch, YARN-1542.v02.patch, YARN-1542.v03.patch, YARN-1542.v04.patch, YARN-1542.v05.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171625#comment-14171625 ] Steve Loughran commented on YARN-2689: -- more traces; {code} 2014-10-14 15:07:41,573 [main] DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: classTeardown entered state INITED 2014-10-14 15:07:41,573 [main] DEBUG service.CompositeService (CompositeService.java:serviceInit(104)) - classTeardown: initing services, size=0 2014-10-14 15:07:41,573 [main] DEBUG service.CompositeService (CompositeService.java:serviceStart(115)) - classTeardown: starting services, size=0 2014-10-14 15:07:41,573 [main] DEBUG service.AbstractService (AbstractService.java:start(197)) - Service classTeardown is started 2014-10-14 15:07:41,587 [main] DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: registrySecurity entered state INITED 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(259)) - Configuration: 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(260)) - --- 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - debug: true 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - transport: TCP 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - max.ticket.lifetime: 8640 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - org.name: EXAMPLE 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - kdc.port: 0 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - org.domain: COM 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - max.renewable.lifetime: 60480 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - instance: DefaultKrbServer 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(262)) - kdc.bind.address: localhost 2014-10-14 15:07:41,603 [main] INFO minikdc.MiniKdc (MiniKdc.java:init(264)) - --- 2014-10-14 15:07:48,259 [main] INFO minikdc.MiniKdc (MiniKdc.java:initKDCServer(480)) - MiniKdc listening at port: 50351 2014-10-14 15:07:48,259 [main] INFO minikdc.MiniKdc (MiniKdc.java:initKDCServer(481)) - MiniKdc setting JVM krb5.conf to: C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\1413324461603\krb5.conf 2014-10-14 15:07:48,775 [main] INFO secure.AbstractSecureRegistryTest (AbstractSecureRegistryTest.java:setupKDCAndPrincipals(219)) - zookeeper { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/zookeeper.keytab debug=true principal=zookeeper useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; ZOOKEEPER_SERVER { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/zookeeper.keytab debug=true principal=zookeeper/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; alice { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/alice.keytab debug=true principal=alice/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; bob { com.sun.security.auth.module.Krb5LoginModule required keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/bob.keytab debug=true principal=bob/localhost useKeyTab=true useTicketCache=false doNotPrompt=true storeKey=true; }; 2014-10-14 15:07:48,900 [JUnit] INFO secure.AbstractSecureRegistryTest (AbstractSecureRegistryTest.java:login(328)) - Logging in as zookeeper/localhost in context ZOOKEEPER_SERVER with keytab target\kdc\zookeeper.keytab Debug is true storeKey true useTicketCache true useKeyTab true doNotPrompt true ticketCache is null isInitiator true KeyTab is C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab refreshKrb5Config is true principal is zookeeper/localhost tryFirstPass is false useFirstPass is false storePass is false clearPass is false Refreshing Kerberos configuration Acquire TGT from Cache Principal is zookeeper/localh...@example.com null credentials from Ticket Cache principal is zookeeper/localh...@example.com Will use keytab Commit Succeeded 2014-10-14 15:07:49,165 [JUnit] DEBUG service.AbstractService (AbstractService.java:enterState(452)) - Service: test-testUserZookeeperHomePathAccess entered state INITED 2014-10-14 15:07:49,165 [JUnit]
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171628#comment-14171628 ] Steve Loughran commented on YARN-2689: -- key section {code} 15:07:49 PDT 2014 Entered Krb5Context.initSecContext with state=STATE_NEW Found ticket for zookee...@example.com to go to krbtgt/example@example.com expiring on Wed Oct 15 15:07:49 PDT 2014 Service ticket not found in the subject KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database at sun.security.krb5.KrbTgsRep.init(KrbTgsRep.java:73) at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:192) at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:203) at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:309) at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:115) at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:454) at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:641) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193) at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:366) at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:363) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:362) at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:348) at org.apache.zookeeper.client.ZooKeeperSaslClient.sendSaslPacket(ZooKeeperSaslClient.java:420) at org.apache.zookeeper.client.ZooKeeperSaslClient.initialize(ZooKeeperSaslClient.java:458) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1013) Caused by: KrbException: Identifier doesn't match expected value (906) at sun.security.krb5.internal.KDCRep.init(KDCRep.java:143) at sun.security.krb5.internal.TGSRep.init(TGSRep.java:66) at sun.security.krb5.internal.TGSRep.init(TGSRep.java:61) at sun.security.krb5.KrbTgsRep.init(KrbTgsRep.java:55) ... 18 more 2014-10-14 15:07:49,816 [JUnit-SendThread(127.0.0.1:50366)] ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(384)) - An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - Server not found in Kerberos database)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. 2014-10-14 15:07:49,816 [JUnit-SendThread(127.0.0.1:50366)] ERROR zookeeper.ClientCnxn (ClientCnxn.java:run(1015)) - SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - Server not found in Kerberos database)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. {code} TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows
[ https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171700#comment-14171700 ] Hadoop QA commented on YARN-2689: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674868/YARN-2689-001.patch against trunk revision cdce883. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5391//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5391//console This message is automatically generated. TestSecureRMRegistryOperations failing on windows - Key: YARN-2689 URL: https://issues.apache.org/jira/browse/YARN-2689 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows server, Java 7, ZK 3.4.6 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2689-001.patch the micro ZK service used in the {{TestSecureRMRegistryOperations}} test doesnt start on windows, {code} org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not configure server because SASL configuration did not allow the ZooKeeper server to authenticate itself properly: javax.security.auth.login.LoginException: Unable to obtain password from user {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171717#comment-14171717 ] Li Lu commented on YARN-2673: - After checking through the code, I'm planning to separate this into two steps. In the first step, we may want to add retry mechanisms to the jersey client that posts timeline entities and domains. After this step, timeline clients for non-secured clusters will be able to retry when timeline server is down. Then, on top of this, we can add retry mechanism to delegation token calls for secured clusters. I'll focus on the non-secured part in this Jira. All secured cluster/token related retry mechanisms will be uploaded to a separate Jira. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171718#comment-14171718 ] zhihai xu commented on YARN-2682: - Hi [~rusanu], thanks for the information, I will call getWorkingDir in WSCE. So the WSCE will also randomly pick the local directory. WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2682: Attachment: YARN-2682.001.patch WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch, YARN-2682.001.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171733#comment-14171733 ] zhihai xu commented on YARN-2682: - I attached a new patch YARN-2682.001.patch to call getWorkingDir instead of getFirstApplicationDir in WSCE and remove getFirstApplicationDir function in DCE. This patch doesn't need unit test because the unit test for getWorkingDir is already done in TestDefaultContainerExecutor. WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch, YARN-2682.001.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171756#comment-14171756 ] Xuan Gong commented on YARN-2656: - overall Looks good. One comment: {code} + public void doFilter(ServletRequest request, ServletResponse response, + FilterChain filterChain) throws IOException, ServletException { +HttpServletRequest req = (HttpServletRequest) request; +// For backward compatibility, allow use of the old header field +final String oldHeader = req.getHeader(OLD_HEADER); +if (oldHeader != null !oldHeader.isEmpty()) { + String newHeader = + req.getHeader(DelegationTokenAuthenticator.DELEGATION_TOKEN_HEADER); + if (newHeader == null || newHeader.isEmpty()) { +HttpServletRequestWrapper wrapper = new HttpServletRequestWrapper(req) { + @Override + public String getHeader(String name) { +if (name + .equals(DelegationTokenAuthenticator.DELEGATION_TOKEN_HEADER)) { + return oldHeader; +} +return super.getHeader(name); + } +}; +super.doFilter(wrapper, response, filterChain); } +} else { + super.doFilter(request, response, filterChain); } {code} Here, we handled case: 1)when oldHeader is null/empty 2)When oldHeader is not null/empty and newHeader is null/empty. Do we need to handle the case when oldHeader is not null/empty and newHeader is not null/empty here as well ? So, maybe we could check newHeader first. As I understand, if newHeader is not null/empty, it will be used no matter whether oldHeader is set or not. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.
[ https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171766#comment-14171766 ] Hadoop QA commented on YARN-2682: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674880/YARN-2682.001.patch against trunk revision cdce883. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5392//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5392//console This message is automatically generated. WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir. - Key: YARN-2682 URL: https://issues.apache.org/jira/browse/YARN-2682 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2682.000.patch, YARN-2682.001.patch DefaultContainerExecutor won't use getFirstApplicationDir any more. But we can't delete getFirstApplicationDir in DefaultContainerExecutor because WindowsSecureContainerExecutor uses it. We should move getFirstApplicationDir function from DefaultContainerExecutor to WindowsSecureContainerExecutor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171802#comment-14171802 ] Hadoop QA commented on YARN-2496: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674883/YARN-2496-20141014-1.patch against trunk revision cdce883. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5393//console This message is automatically generated. Changes for capacity scheduler to support allocate resource respect labels -- Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496-20141009-1.patch, YARN-2496-20141014-1.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171803#comment-14171803 ] Varun Vasudev commented on YARN-2656: - If both headers are specified, we should use the new one. That code is only there for backwards compatibility. If the new header is specified, there's no need for us to do anything. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171813#comment-14171813 ] Xuan Gong commented on YARN-2656: - No even need to call super.doFilter() ??? RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171816#comment-14171816 ] Varun Vasudev commented on YARN-2656: - bq. No even need to call super.doFilter() ??? I completely missed that. Great catch! [~zjshen], we need to handle the case Xuan pointed out. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2656: Assignee: Zhijie Shen (was: Varun Vasudev) RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Zhijie Shen Attachments: YARN-2656.3.patch, YARN-2656.4.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2656: -- Attachment: YARN-2656.5.patch Good catch! Fix the issue in the newest patch. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Zhijie Shen Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
[ https://issues.apache.org/jira/browse/YARN-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171851#comment-14171851 ] Beckham007 commented on YARN-2686: -- [~ywskycn] looking forward for the patach. Should we also support the libcgroup of Redhat 7? CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7 - Key: YARN-2686 URL: https://issues.apache.org/jira/browse/YARN-2686 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Beckham007 CgroupsLCEResourcesHandler uses , to seprating resourcesOption. Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So container-executor would use the wrong path /sys/fs/cgroup/cpu as the container task file. It should be /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks. We should someother character instand of ,. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2423: Attachment: YARN-2423.patch The patch adds the GET APIs. I modeled them after the get methods in TimelineReader. It also fixes a bug I ran into in the MemoryTimelineStore where the related entities information was not getting stored. Besides the unit tests, I also verified it in an actual cluster. TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Attachments: YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
[ https://issues.apache.org/jira/browse/YARN-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171890#comment-14171890 ] Wei Yan commented on YARN-2686: --- [~beckham007], Redhat doesn't recommand that way. To avoid conflicts, do not use libcgroup tools for default resource controllers (listed in Available Controllers in Red Hat Enterprise Linux 7) that are now an exclusive domain of systemd. So we plan to move the cpu part using systemd. CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7 - Key: YARN-2686 URL: https://issues.apache.org/jira/browse/YARN-2686 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Beckham007 CgroupsLCEResourcesHandler uses , to seprating resourcesOption. Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So container-executor would use the wrong path /sys/fs/cgroup/cpu as the container task file. It should be /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks. We should someother character instand of ,. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171918#comment-14171918 ] Xuan Gong commented on YARN-2656: - +1 LGTM RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Zhijie Shen Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414.patch Upload a patch for this issue. TimelineClient will by default retry for a given amount of time before throw the exception on posting to server. There are a few notes: 1. Retrying vs. discarding timeline data: If we do not adding this retry, timeline client will drop the posted data if the first attempt has failed. Had a offline discussion with [~vinodkv]. We agreed that blocking the timeline client for a short while is better, since we may not want to drop some critical timeline data. 2. Retry behavior configurations: Users can define maximum retry counts, and time interval between consecutive retries. We may want to have two levels of retry settings: a cluster global settings, managed by yarn-site.xml, and a per-application customize setting. For the cluster setting, I've added two configuration properties, yarn.timeline-service.client.max-retries (default 30) and yarn.timeline-service.client.retry-interval-ms (default 1000). I've also provide a customizeRetrySettings method for application specific retry settings. 3. Retry implementation: timeline client does not use RPC, but uses RESTful APIs. I'm implementing retry as a jersey filter in this patch. 4. Tests: I added two new unit tests, one to test the customizeRetrySettings API and the other to test if the retry has actually happened when we try to post timeline entities. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171932#comment-14171932 ] Hadoop QA commented on YARN-2656: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674896/YARN-2656.5.patch against trunk revision 0260231. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5394//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5394//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5394//console This message is automatically generated. RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Zhijie Shen Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171940#comment-14171940 ] Hadoop QA commented on YARN-2673: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674907/YARN-2673-101414.patch against trunk revision 0260231. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.client.api.impl.TestTimelineClient {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5396//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5396//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5396//console This message is automatically generated. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-1.patch Address the comments from findbugs, and retry the unit test failure. Could not reproduce the UT failure locally. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME
[ https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Chiang updated YARN-2603: - Attachment: YARN-2603-01.patch Add HADOOP_MAPRED_HOME to the list of environment variables. ApplicationConstants missing HADOOP_MAPRED_HOME --- Key: YARN-2603 URL: https://issues.apache.org/jira/browse/YARN-2603 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Labels: newbie Attachments: YARN-2603-01.patch The Environment enum should have HADOOP_MAPRED_HOME listed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171975#comment-14171975 ] Hadoop QA commented on YARN-2673: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674913/YARN-2673-101414-1.patch against trunk revision 0260231. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.client.api.impl.TestTimelineClient {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5397//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5397//console This message is automatically generated. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user
[ https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171993#comment-14171993 ] Hudson commented on YARN-2656: -- FAILURE: Integrated in Hadoop-trunk-Commit #6262 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6262/]) YARN-2656. Made RM web services authentication filter support proxy user. Contributed by Varun Vasudev and Zhijie Shen. (zjshen: rev 1220bb72d452521c6f09cebe1dd77341054ee9dd) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/web/DelegationTokenAuthenticationHandler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java RM web services authentication filter should add support for proxy user --- Key: YARN-2656 URL: https://issues.apache.org/jira/browse/YARN-2656 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Varun Vasudev Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch The DelegationTokenAuthenticationFilter adds support for doAs functionality. The RMAuthenticationFilter should expose this as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-2.patch Debugging the UT failure. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-2183: -- Attachment: YARN-2183-trunk-v5.patch Patch v.5 posted. To help see the overall diffs, you can use this github diff: https://github.com/ctrezzo/hadoop/compare/apache:trunk...sharedcache-3-YARN-2183-cleaner Changes between v.4 and v.5: - https://github.com/ctrezzo/hadoop/commit/1be0a159a739578a0f5d89e6881f6fb63aeccfa6 - https://github.com/ctrezzo/hadoop/commit/b229d3ba5f592f231455526829b09db244264fcb Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172007#comment-14172007 ] Sangjin Lee commented on YARN-2183: --- Thanks Karthik for the review. We have addressed most of your review comments in the updated patch. Items that need more discussion are below. (CleanerService) {quote} runCleanerTask: Instead of checking if there is a scheduled-cleaner-task running here, why not just rely on the check in CleanerTask#run(). Agree, we might be doing a little more work here unnecessarily, but not sure the savings are worth an extra check and an extra parameter in the CleanerTask constructor. {quote} We do have the on-demand cleaner feature (YARN-2189; see below for more), and this check is needed to prevent a race (i.e. not allow an on-demand run when a scheduled run is in progress). {quote} How does a user use runCleanerTask? Instantiate another SCM? The SCM isn't listening to any requests. I can see the SCM being run in the RM, and one could potentially add yarn rmadmin -clean-shared-cache. In any case, given there is no way to reach a running SCM, I would remove runCleanerTask altogether for now, and add it back later when we need it? Thoughts? {quote} As we discussed offline, we do have a YARN admin command implemented that lets you run the cleaner task on demand (YARN-2189). The admin command implements a proper ACL check (based on a YARN admin credential), and sends an RPC request to the running SCM. Since patches are organized this way, it may not have been very obvious by looking at this patch alone. {quote} Should we worry about users starting SCMs with roots at different levels that can lead to multiple cleaners? {quote} In theory it is possible. However, in reality checking that might be being bit too cautious. I think it might be fine without this check. Let me know what you think. (CleanerTask) {quote} Should cleanResourceReferences be moved to SCMStore? {quote} That’s an interesting suggestion. In fact, with the InMemorySCMStore (which already has a reference to an AppChecker to clean up the initial apps) it may be OK to create SCMStore.cleanResourceReferences(). However, it’s not clear to me whether a dependency from an SCMStore to an AppChecker is always a fine requirement for other types of stores. In that sense, I would be hesitant to create this coupling by introducing SCMStore.cleanResourceReferences(). What do you think? {quote} For the race condition (YARN-2663), would it help to handle the delete files on HDFS in the store#remove? {quote} That’s a possibility. However, I think there can be an easier way to fix this race condition. This is partially due to the way the cleaner task is deleting unused files. Currently it deletes the entire directory as opposed to the specific file. The race can be fixed by avoiding deleting the directory. We’ll add the proper fix later on YARN-2663. (CleanerMetrics) {quote} Make initSingleton private and call it in getInstance if the instance is null? {quote} We looked into that, but the difficulty is initSingleton() needs the configuration which getInstance() does not provide. {quote} How about using MutableRate or MutableStat for the rates? {quote} We have removed the rate-specific metrics altogether as they can be derived from the original metrics. {quote} Do we need CleanerMetricsCollector, wouldn't CleanerMetrics extending MetricsSource suffice? {quote} For making the metrics available in JMX, indeed CleanerMetrics would suffice. CleanerMetricsCollector was introduced to create a web UI (HTML) to select a handful of metrics to show in HTML. That was to come in YARN-2203. Having said that, I think we can remove CleanerMetricsCollector for now, and not introduce the web UI. That UI is minimal anyway, and we can consider introducing a more refined web UI at a later point once this version is released. Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101414-2.patch) Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-101414-2.patch Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172026#comment-14172026 ] Hadoop QA commented on YARN-2673: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674921/YARN-2673-101414-2.patch against trunk revision 1220bb7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5399//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5399//console This message is automatically generated. Add retry for timeline client - Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172038#comment-14172038 ] Hadoop QA commented on YARN-2183: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674922/YARN-2183-trunk-v5.patch against trunk revision 1220bb7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5400//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5400//console This message is automatically generated. Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)