date:20141014


 [ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2682:

Attachment: YARN-2682.000.patch

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.


[ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170629#comment-14170629
 ] 

Hadoop QA commented on YARN-2682:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674709/YARN-2682.000.patch
  against trunk revision 5faaba0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5384//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5384//console

This message is automatically generated.

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-570) Time strings are formated in different timezone

2014-10-14 Thread Akira AJISAKA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-570:
---
Attachment: YARN-570.4.patch

Thanks [~rchiang] for trying the patch. Update the patch to uniform the format 
to EEE MMM dd HH:mm:ss Z .

 Time strings are formated in different timezone
 ---

 Key: YARN-570
 URL: https://issues.apache.org/jira/browse/YARN-570
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.2.0
Reporter: Peng Zhang
Assignee: Akira AJISAKA
 Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, 
 YARN-570.3.patch, YARN-570.4.patch


 Time strings on different page are displayed in different timezone.
 If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
 Wed, 10 Apr 2013 08:29:56 GMT
 If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 
 16:29:56
 Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7

2014-10-14 Thread Beckham007 (JIRA)

Beckham007 created YARN-2686:


 Summary: CgroupsLCEResourcesHandler does not support the default 
Redhat 7/CentOS 7
 Key: YARN-2686
 URL: https://issues.apache.org/jira/browse/YARN-2686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Beckham007


CgroupsLCEResourcesHandler uses , to seprating resourcesOption.
Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So 
container-executor would use the wrong path /sys/fs/cgroup/cpu as the 
container task file. It should be 
/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks.
We should someother character instand of ,.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-570) Time strings are formated in different timezone


[ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170667#comment-14170667
 ] 

Hadoop QA commented on YARN-570:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674718/YARN-570.4.patch
  against trunk revision 5faaba0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5385//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5385//console

This message is automatically generated.

 Time strings are formated in different timezone
 ---

 Key: YARN-570
 URL: https://issues.apache.org/jira/browse/YARN-570
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.2.0
Reporter: Peng Zhang
Assignee: Akira AJISAKA
 Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, 
 YARN-570.3.patch, YARN-570.4.patch


 Time strings on different page are displayed in different timezone.
 If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
 Wed, 10 Apr 2013 08:29:56 GMT
 If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 
 16:29:56
 Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-90) NodeManager should identify failed disks becoming good back again


 [ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-90:
--
Attachment: apache-yarn-90.9.patch

Uploaded a new patch to address comments by [~mingma] and [~zxu].

bq. Nit: For SetString postCheckFullDirs = new HashSetString(fullDirs);. 
It doesn't have to create postCheckFullDirs. It can directly refer to fullDirs 
later.

It was just to ease lookups - instead of searching through a list, lookup a 
set. If you feel strongly about it, I can change it.

{quote}
can change
if (!postCheckFullDirs.contains(dir)  postCheckOtherDirs.contains(dir)) {
to
if (postCheckOtherDirs.contains(dir)) {
{quote}

Fixed.

{quote}
change
if (!postCheckOtherDirs.contains(dir)  postCheckFullDirs.contains(dir)) {
to
if (postCheckFullDirs.contains(dir)) {
{quote}

Fixed.

{quote}
3. in verifyDirUsingMkdir:
Can we add int variable to file name to avoid loop forever(although it is a 
very small chance) like the following?
long i = 0L;
while (target.exists())
\{ randomDirName = RandomStringUtils.randomAlphanumeric(5) + i++; target = new 
File(dir, randomDirName); }
{quote}

Fixed.

{quote}
4. in disksTurnedBad:
Can we add break in the loop when disksFailed is true so we exit the loop 
earlier?
if (!preCheckDirs.contains(dir))
\{ disksFailed = true; break; }
{quote}

Fixed.

{quote}
5. in disksTurnedGood same as item 4:
Can we add break in the loop when disksTurnedGood is true?
{quote}

Fixed.

{quote}
In function verifyDirUsingMkdir, target.exists(), target.mkdir() and 
FileUtils.deleteQuietly(target) is not atomic,
What happen if another thread try to create the same directory(target)?
{quote}

verifyDirUsingMkdir is called by testDirs which is called by checkDirs() which 
is synchronized.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
 apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good back again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170711#comment-14170711
 ] 

Hadoop QA commented on YARN-90:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674722/apache-yarn-90.9.patch
  against trunk revision 5faaba0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5386//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5386//console

This message is automatically generated.

 NodeManager should identify failed disks becoming good back again
 -

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, 
 apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, 
 apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.

2014-10-14 Thread Remus Rusanu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170716#comment-14170716
 ] 

Remus Rusanu commented on YARN-2682:


WSCE should behave the same as DCE. If getFirstApplicationDir() was removed 
from DCE and getApplicationDir() is used instead, then WSCE should also use 
getApplicationDir(), w/o a need to define getFirstApplicationDir() in WSCE.

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext


[ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170809#comment-14170809
 ] 

Hudson commented on YARN-2651:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/711/])
YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed 
by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java
* hadoop-yarn-project/CHANGES.txt


 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch


 Remove per-app rolling interval completely and then have nodemanager wake up 
 every so often and upload old log files. The wake up time is per-NM 
 configuration, and is decoupled with the actual app's log rolling interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170808#comment-14170808
 ] 

Hudson commented on YARN-2641:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/711/])
YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. 
(Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java


 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.7.0

 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170818#comment-14170818
 ] 

Hudson commented on YARN-2566:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/711/])
YARN-2566. DefaultContainerExecutor should pick a working directory randomly. 
(Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* hadoop-yarn-project/CHANGES.txt


 DefaultContainerExecutor should pick a working directory randomly
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at

[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info


[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170816#comment-14170816
 ] 

Hudson commented on YARN-2377:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/711/])
YARN-2377. Localization exception stack traces are not passed as diagnostic 
info. Contributed by Gera Shegalov (jlowe: rev 
a56ea0100215ecf2e1471a18812b668658197239)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java
* hadoop-yarn-project/CHANGES.txt


 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Fix For: 2.6.0

 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170812#comment-14170812
 ] 

Hudson commented on YARN-2308:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #711 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/711/])
YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue 
(jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
Missing Changes.txt for YARN-2308 (jianhe: rev 
178bc505da5d06d591a19aac13c040c6a9cf28ad)
* hadoop-yarn-project/CHANGES.txt


 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)

2014-10-14 Thread Sunil G (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170860#comment-14170860
]

Sunil G commented on YARN-2495:
---

Hi All.

I have a doubt here,

1. In distributed configuration, each NM can specify label in
register/heartbeat(update). I am not sure check for Valid Label to happen in
RM or NM. As per the current design, it looks like all valid checks are
happening at RM.
If any such node label is invalid as per RM, then how this will be reported
back to NM? Error Handling?

2. If possible to change label at run time from NM, i think same existing
interfaces are used (heartbeat). Do you feel this check will happen may be more
frequent in RM than in a Centralized configuration? In centralized config, some
command will be fired by admin to change labels. This may not be frequent. But
imagine a 1000 node cluster, and then with changing labels per heartbeat, will
this be a bottleneck?

Allow admin specify labels in each NM (Distributed configuration)
-

Key: YARN-2495
URL: https://issues.apache.org/jira/browse/YARN-2495
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R

Target of this JIRA is to allow admin specify labels in each NM, this covers
- User can set labels in each NM (by setting yarn-site.xml or using script
suggested by [~aw])
- NM will send labels to RM via ResourceTracker API
- RM will set labels in NodeLabelManager when NM register/update labels

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7

2014-10-14 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170898#comment-14170898
 ] 

Wei Yan commented on YARN-2686:
---

According to the document of Redhat 7, it is not recommended to use libcgroup 
to do the cpu isolation. So YARN-2194 is working on systemd-based solution. 
Will update a patch for that jira soon.

 CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
 -

 Key: YARN-2686
 URL: https://issues.apache.org/jira/browse/YARN-2686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Beckham007

 CgroupsLCEResourcesHandler uses , to seprating resourcesOption.
 Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So 
 container-executor would use the wrong path /sys/fs/cgroup/cpu as the 
 container task file. It should be 
 /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks.
 We should someother character instand of ,.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2552) Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only

2014-10-14 Thread Remus Rusanu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170907#comment-14170907
 ] 

Remus Rusanu commented on YARN-2552:


Copying here the patch' apt.vm update:

 'yarn.nodemanager.windows-secure-container-executor.local-dirs' should 
 contain the nodemanager local dirs. hadoopwinutilsvc will allow only file 
 operations under these directories. This should contain the same values as 
 '${yarn.nodemanager.local-dirs}, ${yarn.nodemanager.log-dirs}'  but note that 
 hadoopwinutilsvc XML configuration processing does not do substitutions so 
 the value must be the final value. All paths must be absolute and no 
 environment variable substitution will be performed. The paths are compared 
 LOCAL_INVARIANT case insensitive string comparison, the file path validated 
 must start with one of the paths listed in local-dirs configuration. Use 
 comma as path separator.



 Windows Secure Container Executor: the privileged file operations of 
 hadoopwinutilsvc should be constrained to localdirs only
 -

 Key: YARN-2552
 URL: https://issues.apache.org/jira/browse/YARN-2552
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
  Labels: security, windows, wsce
 Attachments: YARN-2552.1.patch


 YARN-2458 added file manipulation operations executed in an elevated context 
 by hadoopwinutilsvc. W/o any constraint, the NM (or a hijacker that takes 
 over the NM) can manipulate arbitrary OS files under highest possible 
 privileges, an easy elevation attack vector. The service should only allow 
 operations on files/directories that are under the configured NM localdirs. 
 It should read this value from wsce-site.xml, as the yarn-site.xml cannot be 
 trusted, being writable by Hadoop admins (YARN-2551 ensures wsce-site.xml is 
 only writable by system Administrators, not Hadoop admins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2056) Disable preemption at Queue level

2014-10-14 Thread Eric Payne (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-2056:
-
Attachment: YARN-2056.201410141330.txt

I'm sorry. The previous patch was bad. This one compiles cleanly.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt, YARN-2056.201410141330.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry


[ 
https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170953#comment-14170953
 ] 

Hudson commented on YARN-2667:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/])
YARN-2667. Fix the release audit warning caused by hadoop-yarn-registry. 
Contributed by Yi Liu (jlowe: rev 344a10ad5e26c25abd62eda65eec2820bb808a74)
* hadoop-yarn-project/CHANGES.txt
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/pom.xml


 Fix the release audit warning caused by hadoop-yarn-registry
 

 Key: YARN-2667
 URL: https://issues.apache.org/jira/browse/YARN-2667
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2667.001.patch


 ? 
 /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep
 Lines that start with ? in the release audit report indicate files that 
 do not have an Apache license header.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext


[ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170955#comment-14170955
 ] 

Hudson commented on YARN-2651:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/])
YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed 
by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java


 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch


 Remove per-app rolling interval completely and then have nodemanager wake up 
 every so often and upload old log files. The wake up time is per-NM 
 configuration, and is decoupled with the actual app's log rolling interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info


[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170962#comment-14170962
 ] 

Hudson commented on YARN-2377:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/])
YARN-2377. Localization exception stack traces are not passed as diagnostic 
info. Contributed by Gera Shegalov (jlowe: rev 
a56ea0100215ecf2e1471a18812b668658197239)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java


 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Fix For: 2.6.0

 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170958#comment-14170958
 ] 

Hudson commented on YARN-2308:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/])
YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue 
(jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
Missing Changes.txt for YARN-2308 (jianhe: rev 
178bc505da5d06d591a19aac13c040c6a9cf28ad)
* hadoop-yarn-project/CHANGES.txt


 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170964#comment-14170964
 ] 

Hudson commented on YARN-2566:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/])
YARN-2566. DefaultContainerExecutor should pick a working directory randomly. 
(Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt


 DefaultContainerExecutor should pick a working directory randomly
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170954#comment-14170954
 ] 

Hudson commented on YARN-2641:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1901 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1901/])
YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. 
(Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java


 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.7.0

 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

[
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe reassigned YARN-2314:

Assignee: Jason Lowe

bq. Basically the cache doesn't have more functionalities other than just cache
the connection.

It doesn't even do that, because if we cache the connection to the NM then we
leak threads. When a cache entry is purged the RPC Client thread (tied to the
NM socket connection) can linger because the RPC layer doesn't provide a way to
force a connection to be closed due to protocol refcounting. We need to set
the RPC idle timeout to 0 as a workaround to force the connections to close so
we don't leak threads. Therefore all the cache is doing is caching the proxy
objects with no connection behind them. Those objects will reconnect to the NM
each time we make a call.

Not sure saving the proxy objects themselves is worth it -- would be
interesting to prove this cache helps in a meaningful way before we assume we
need it. But I can update the patch to provide a config property to keep it
anyway, hope to have that up later today.

ContainerManagementProtocolProxy can create thousands of threads for a large
cluster

Key: YARN-2314
URL: https://issues.apache.org/jira/browse/YARN-2314
Project: Hadoop YARN
Issue Type: Bug
Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
Attachments: disable-cm-proxy-cache.patch,
nmproxycachefix.prototype.patch

ContainerManagementProtocolProxy has a cache of NM proxies, and the size of
this cache is configurable. However the cache can grow far beyond the
configured size when running on a large cluster and blow AM address/container
limits. More details in the first comment.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2687) WindowsSecureContainerExecutor hadoopwinutilsvc is difficult to troubleshoot

2014-10-14 Thread Remus Rusanu (JIRA)

Remus Rusanu created YARN-2687:
--

 Summary: WindowsSecureContainerExecutor hadoopwinutilsvc is 
difficult to troubleshoot
 Key: YARN-2687
 URL: https://issues.apache.org/jira/browse/YARN-2687
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu


The hadoopwinutilsvc logs using the NT service logging infrastructure (ie. 
Event Viewer). Ideally it should log within the Hadoop logging expected 
location/format, and be configured via the same parameters.
As native C++ code it cannot leverage directly the log4j (and log4c++ is rather 
different config etc). I'm thinking that the hadoopwinutilsvc could establish a 
communication channel with NM itself and log via the NM. We already have the 
infrastructure in place (RPC, IDL etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171018#comment-14171018
 ] 

Hadoop QA commented on YARN-2056:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674768/YARN-2056.201410141330.txt
  against trunk revision 5faaba0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5387//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5387//console

This message is automatically generated.

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt, YARN-2056.201410141330.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)

2014-10-14 Thread Naganarasimha G R (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171027#comment-14171027
]

Naganarasimha G R commented on YARN-2495:
-

Thanks [~aw],[~wangda] [~sunilg],

*For [~aw] comments:*
bq. I don't fully understand your question, but I'll admit I've been distracted
with other JIRAs lately.
You got first part of my question right and thanks for detailing the scenario

bq. If we are rolling out a new version of the JDK, I shouldn't have to tell
the system that it's ok to broadcast that JDK version first.
I understood the use case but what i did not understand is how would it
restrict/deter a user, as he can do one more updation ; one more label to the
central valid label list, like java version or jdk version etc. As anyway
script will be written/updated to get specific set of labels so i feel in most
cases admin can know what lables will be coming in the cluster. Any other use
case where it will be difficult for admin to list the labels before hand ?

*For [~wangda] comments:*
bq. they will be reported to RM when NM registration. We may not need to
persist any of them, but RM should know these labels existence to do scheduling.
Does RM needs to have all the list of valid labels even before registering of
all nodes are done ? How will it impact scheduling ? How is it different from
Central configuration ? As in centralized config; user needs to update the new
labels and then send the node to lables mapping . similarly in distributed
config first we can find out the new labels and update the super set list of
lables and then update the label mapping for a node which wants update or
modify labels.

bq. Another question is if we need check labels when they registering, I prefer
to pre-set them because this affects scheduling behavior. For example, the
maximum-resource and minimum-resource are setup in RM side, and RackResolver is
also run in RM side
May be this i did not get it correctly. You mean when NM is registering for
the first time after startup, you want it have preset apart from what is read
from NM's yarn-site.xml/script ? did not get this clearly please elaborate.

bq. At least, the label checking should be kept configurable in distributed
mode. – just ignore all the labels for that node if invalid labels exists
might be a good way when it enabled.
in your earlier stmt you said it affects scheduling, if so then if its kept
configurable then how will that solve ? But what was clear was
* Support to add and remove Valid label and centralized level is required
* RM will do the label validation on NM registraion heartbeat
* If while validating (during NM registraion heartbeat) if one of the labels
fail for a given node. then we will just ignore all the labels for that node.

*For [~sunilg] comments:*
bq. If any such node label is invalid as per RM, then how this will be reported
back to NM? Error Handling?
I too have the same doubt and feel that usability will be reduced as script is
executed some where and the validations are happening some where, if error is
not propagated back to NM.

bq. But imagine a 1000 node cluster, and then with changing labels per
heartbeat, will this be a bottleneck?
we will not be changing label for every heart beat i will try to ensure that
during heartbeat only if the labels have changed from previous set of labels
for a node only then it will send the updated label set. But issue will be
there that lot of contention will happen suppose some script is modified and
all 2000 nodes want to update their labels

Allow admin specify labels in each NM (Distributed configuration)
-

Key: YARN-2495
URL: https://issues.apache.org/jira/browse/YARN-2495
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Naganarasimha G R

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2377) Localization exception stack traces are not passed as diagnostic info


[ 
https://issues.apache.org/jira/browse/YARN-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171042#comment-14171042
 ] 

Hudson commented on YARN-2377:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/])
YARN-2377. Localization exception stack traces are not passed as diagnostic 
info. Contributed by Gera Shegalov (jlowe: rev 
a56ea0100215ecf2e1471a18812b668658197239)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/SerializedException.java
* hadoop-yarn-project/CHANGES.txt


 Localization exception stack traces are not passed as diagnostic info
 -

 Key: YARN-2377
 URL: https://issues.apache.org/jira/browse/YARN-2377
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.4.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Fix For: 2.6.0

 Attachments: YARN-2377.v01.patch, YARN-2377.v02.patch


 In the Localizer log one can only see this kind of message
 {code}
 14/07/31 10:29:00 INFO localizer.ResourceLocalizationService: DEBUG: FAILED { 
 hdfs://ha-nn-uri-0:8020/tmp/hadoop-yarn/staging/gshegalov/.staging/job_1406825443306_0004/job.jar,
  1406827248944, PATTERN, (?:classes/|lib/).* }, java.net.UnknownHos 
 tException: ha-nn-uri-0
 {code}
 And then only {{ java.net.UnknownHostException: ha-nn-uri-0}} message is 
 propagated as diagnostics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed


[ 
https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171038#comment-14171038
 ] 

Hudson commented on YARN-2308:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/])
YARN-2308. Changed CapacityScheduler to explicitly throw exception if the queue 
(jianhe: rev f9680d9a160ee527c8f2c1494584abf1a1f70f82)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
Missing Changes.txt for YARN-2308 (jianhe: rev 
178bc505da5d06d591a19aac13c040c6a9cf28ad)
* hadoop-yarn-project/CHANGES.txt


 NPE happened when RM restart after CapacityScheduler queue configuration 
 changed 
 -

 Key: YARN-2308
 URL: https://issues.apache.org/jira/browse/YARN-2308
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, scheduler
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: chang li
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2308.0.patch, YARN-2308.1.patch, jira2308.patch, 
 jira2308.patch, jira2308.patch


 I encountered a NPE when RM restart
 {code}
 2014-07-16 07:22:46,957 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
 handling event type APP_ATTEMPT_ADDED to the scheduler
 java.lang.NullPointerException
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 at 
 org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
 at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 And RM will be failed to restart.
 This is caused by queue configuration changed, I removed some queues and 
 added new queues. So when RM restarts, it tries to recover history 
 applications, and when any of queues of these applications removed, NPE will 
 be raised.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2641) Decommission nodes on -refreshNodes instead of next NM-RM heartbeat


[ 
https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171034#comment-14171034
 ] 

Hudson commented on YARN-2641:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/])
YARN-2641. Decommission nodes on -refreshNodes instead of next NM-RM heartbeat. 
(Zhihai Xu via kasha) (kasha: rev da709a2eac7110026169ed3fc4d0eaf85488d3ef)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java


 Decommission nodes on -refreshNodes instead of next NM-RM heartbeat
 ---

 Key: YARN-2641
 URL: https://issues.apache.org/jira/browse/YARN-2641
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.7.0

 Attachments: YARN-2641.000.patch, YARN-2641.001.patch, 
 YARN-2641.002.patch, YARN-2641.003.patch


 improve node decommission latency in RM. 
 Currently the node decommission only happened after RM received nodeHeartbeat 
 from the Node Manager. The node heartbeat interval is configurable. The 
 default value is 1 second.
 It will be better to do the decommission during RM Refresh(NodesListManager) 
 instead of nodeHeartbeat(ResourceTrackerService).
 This will be a much more serious issue:
 After RM is refreshed (refreshNodes), If the NM to be decommissioned is 
 killed before NM sent heartbeat to RM. The RMNode will never be 
 decommissioned in RM. The RMNode will only expire in RM after  
 yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2651) Spin off the LogRollingInterval from LogAggregationContext


[ 
https://issues.apache.org/jira/browse/YARN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171035#comment-14171035
 ] 

Hudson commented on YARN-2651:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/])
YARN-2651. Spun off LogRollingInterval from LogAggregationContext. Contributed 
by Xuan Gong. (zjshen: rev 4aed2d8e91c7dccc78fbaffc409d3076c3316289)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/LogAggregationContextPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationContext.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManagerRecovery.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto


 Spin off the LogRollingInterval from LogAggregationContext
 --

 Key: YARN-2651
 URL: https://issues.apache.org/jira/browse/YARN-2651
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2651.1.1.patch, YARN-2651.1.patch


 Remove per-app rolling interval completely and then have nodemanager wake up 
 every so often and upload old log files. The wake up time is per-NM 
 configuration, and is decoupled with the actual app's log rolling interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2667) Fix the release audit warning caused by hadoop-yarn-registry


[ 
https://issues.apache.org/jira/browse/YARN-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171033#comment-14171033
 ] 

Hudson commented on YARN-2667:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/])
YARN-2667. Fix the release audit warning caused by hadoop-yarn-registry. 
Contributed by Yi Liu (jlowe: rev 344a10ad5e26c25abd62eda65eec2820bb808a74)
* hadoop-yarn-project/CHANGES.txt
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/pom.xml


 Fix the release audit warning caused by hadoop-yarn-registry
 

 Key: YARN-2667
 URL: https://issues.apache.org/jira/browse/YARN-2667
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor
 Fix For: 2.6.0

 Attachments: YARN-2667.001.patch


 ? 
 /home/jenkins/jenkins-slave/workspace/PreCommit-HADOOP-Build/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/resources/.keep
 Lines that start with ? in the release audit report indicate files that 
 do not have an Apache license header.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2566) DefaultContainerExecutor should pick a working directory randomly


[ 
https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171044#comment-14171044
 ] 

Hudson commented on YARN-2566:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1926 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1926/])
YARN-2566. DefaultContainerExecutor should pick a working directory randomly. 
(Zhihai Xu via kasha) (kasha: rev cc93e7e683fa74eb1a7aa2b357a36667bd21086a)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* hadoop-yarn-project/CHANGES.txt


 DefaultContainerExecutor should pick a working directory randomly
 -

 Key: YARN-2566
 URL: https://issues.apache.org/jira/browse/YARN-2566
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-2566.000.patch, YARN-2566.001.patch, 
 YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, 
 YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch, 
 YARN-2566.008.patch


 startLocalizer in DefaultContainerExecutor will only use the first localDir 
 to copy the token file, if the copy is failed for first localDir due to not 
 enough disk space in the first localDir, the localization will be failed even 
 there are plenty of disk space in other localDirs. We see the following error 
 for this case:
 {code}
 2014-09-13 23:33:25,171 WARN 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to 
 create app directory 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004
 java.io.IOException: mkdir of 
 /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed
   at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157)
   at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721)
   at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
 2014-09-13 23:33:25,185 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
 java.io.FileNotFoundException: File 
 file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 
 does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76)
   at 
 org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390)
   at 
 org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677)
   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.create(FileContext.java:673)
   at

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Wangda Tan (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171325#comment-14171325
]

Wangda Tan commented on YARN-2314:
--

Hi [~jlowe],
Thanks for your comment, I also agree just caching the proxy object itself may
not be necessary. The behavior in my mind should be, admin can setup if the
container management proxy is disabled.
- If it is disabled, IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY will be set to 0 and
all the cache logic will be disabled as what you have done in your patch.
- If it is enabled, we should keep the existing behavior (or improve the LRU
cache as other patch in this JIRA), but basically, it's better to keep it. I'm
a little doubt about if there is any other potential bug if we completely
remove it.

Thanks,

ContainerManagementProtocolProxy can create thousands of threads for a large
cluster

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171332#comment-14171332
 ] 

Steve Loughran commented on YARN-2689:
--

Example of one failing test. 
{code}
Running org.apache.hadoop.registry.secure.TestSecureRegistry
Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 10.562 sec  
FAILURE! - in org.apache.hadoop.registry.
secure.TestSecureRegistry
testZookeeperCanWrite(org.apache.hadoop.registry.secure.TestSecureRegistry)  
Time elapsed: 0.344 sec   ERROR!
org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could not 
configure server because SASL configuration did not allow the  ZooKeeper server 
to authenticate itself properly: javax.security.auth.login.LoginException: 
Unable to obtain password from user

at 
org.apache.zookeeper.server.ServerCnxnFactory.configureSaslLogin(ServerCnxnFactory.java:207)
at 
org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:87)
at 
org.apache.hadoop.registry.server.services.MicroZookeeperService.serviceStart(MicroZookeeperService.java:237)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.registry.secure.AbstractSecureRegistryTest.startSecureZK(AbstractSecureRegistryTest.java:352)
at 
org.apache.hadoop.registry.secure.TestSecureRegistry.testZookeeperCanWrite(TestSecureRegistry.java:83)
{code}

searching for string Unable to obtain password from user implies a common 
cause is principal name is spelled wrong and the lookup to whatever kerberos 
state there is fails

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran

 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2588) Standby RM does not transitionToActive if previous transitionToActive is failed with ZK exception.

2014-10-14 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171336#comment-14171336
 ] 

Jian He commented on YARN-2588:
---

Rohith, thanks for the patch, I have a couple of comments:
- the stopActiveServices() call maybe not necessary ? as the AbstractService 
class internally should call stop if any exception occurs.
{code}
} catch (Exception e) {
  stopActiveServices();
{code}
- maybe we can invoke following right before each time start the active 
services?
{code}
  createAndInitActiveServices();
{code}
- fix following code comment 
{code}
// @Test(timeout = 3)
{code}

 Standby RM does not transitionToActive if previous transitionToActive is 
 failed with ZK exception.
 --

 Key: YARN-2588
 URL: https://issues.apache.org/jira/browse/YARN-2588
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.0.0, 2.6.0, 2.5.1
Reporter: Rohith
Assignee: Rohith
 Attachments: YARN-2588.patch


 Consider scenario where, StandBy RM is failed to transition to Active because 
 of ZK exception(connectionLoss or SessionExpired). Then any further 
 transition to Active for same RM does not move RM to Active state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171340#comment-14171340
 ] 

Li Lu commented on YARN-2314:
-

Hi [~wangda], maybe we want to leave a note in the config, saying that enabling 
the RPC cache may cause problems for large cluster (so that people would know 
the possible side-effect of enabling this)? 

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171343#comment-14171343
 ] 

Steve Loughran commented on YARN-2689:
--

output
{code}
2014-10-14 11:31:50,756 [main] DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: classTeardown entered state 
INITED
2014-10-14 11:31:50,772 [main] DEBUG service.CompositeService 
(CompositeService.java:serviceInit(104)) - classTeardown: initing services, 
size=0
2014-10-14 11:31:50,772 [main] DEBUG service.CompositeService 
(CompositeService.java:serviceStart(115)) - classTeardown: starting services, 
size=0
2014-10-14 11:31:50,772 [main] DEBUG service.AbstractService 
(AbstractService.java:start(197)) - Service classTeardown is started
2014-10-14 11:31:50,772 [main] DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: registrySecurity entered 
state INITED
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(259)) 
- Configuration:
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(260)) 
- ---
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   debug: true
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   transport: TCP
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   max.ticket.lifetime: 8640
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   org.name: EXAMPLE
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   kdc.port: 0
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   org.domain: COM
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   max.renewable.lifetime: 60480
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   instance: DefaultKrbServer
2014-10-14 11:31:50,788 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   kdc.bind.address: localhost
2014-10-14 11:31:50,803 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(264)) 
- ---
2014-10-14 11:31:58,772 [main] INFO  minikdc.MiniKdc 
(MiniKdc.java:initKDCServer(480)) - MiniKdc listening at port: 49764
2014-10-14 11:31:58,772 [main] INFO  minikdc.MiniKdc 
(MiniKdc.java:initKDCServer(481)) - MiniKdc setting JVM krb5.conf to: 
C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\1413311510788\krb5.conf
2014-10-14 11:31:59,334 [main] INFO  secure.AbstractSecureRegistryTest 
(AbstractSecureRegistryTest.java:setupKDCAndPrincipals(218)) - 
zookeeper { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab
 principal=zookeeper
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
ZOOKEEPER_SERVER { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab
 principal=zookeeper/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
alice { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\alice.keytab
 principal=alice/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
bob { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\bob.keytab
 principal=bob/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 

2014-10-14 11:31:59,479 [JUnit] INFO  secure.AbstractSecureRegistryTest 
(AbstractSecureRegistryTest.java:login(325)) - Logging in as 
zookeeper/localhost in context ZOOKEEPER_SERVER with keytab 
target\kdc\zookeeper.keytab
Debug is  true storeKey true useTicketCache true useKeyTab true doNotPrompt 
true ticketCache is null isInitiator true KeyTab is 
C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab
 refreshKrb5Config is true principal is zookeeper/localhost tryFirstPass is 
false useFirstPass is false storePass is false clearPass is false
Refreshing Kerberos configuration
Acquire TGT from Cache
Principal is zookeeper/localh...@example.com
null credentials from Ticket Cache
principal is zookeeper/localh...@example.com
Will use keytab
Commit Succeeded 

2014-10-14 11:31:59,693 [JUnit] DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: test-testZookeeperCanWrite 
entered state INITED
2014-10-14 11:31:59,693 [JUnit] INFO  secure.AbstractSecureRegistryTest

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171348#comment-14171348
 ] 

Wangda Tan commented on YARN-2314:
--

[~gtCarrera9], 
Agree, and the disabled should be default behavior. 

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user

2014-10-14 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171354#comment-14171354
 ] 

Zhijie Shen commented on YARN-2656:
---

Kick the jenkins again, as HADOOP-11181 is already committed.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

[
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated YARN-2314:
-
Attachment: YARN-2314.patch

Attaching a patch that allows the existing yarn.client.max-nodemanagers-proxies
to be zero to indicate the proxy cache is disabled. Also per Wangda's comment
the default is 0 (i.e.: cache is disabled). If disabled it sets the idle
timeout to zero, otherwise it leaves it untouched and caches the proxy objects.
The comment for the property was updated to also mention the issue with
lingering connection threads and the potential for the cache to cause problems
on large clusters. This patch also includes my earlier prototype fix to keep
the cache from accidentally increasing in size if connections are busy.

bq. I'm a little doubt about if there is any other potential bug if we
completely remove it.

I'm on the other side of that fence, since we ran for a long time on Hadoop
0.23 without this cache and did not see issues. We've already found two issues
with the cache (grows above the specified size and accumulates lingering
connection threads), and I have yet to see evidence it is needed. If anything
there's some evidence to the contrary from us and Sangjin.

But in case someone running on a smaller cluster really is depending upon this
cache for some use case, the patch tries to let large clusters work yet small
cluster users can turn on this cache.

ContainerManagementProtocolProxy can create thousands of threads for a large
cluster

Key: YARN-2314
URL: https://issues.apache.org/jira/browse/YARN-2314
Project: Hadoop YARN
Issue Type: Bug
Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch,
nmproxycachefix.prototype.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171408#comment-14171408
 ] 

Bikas Saha commented on YARN-2314:
--

Folks, this is something that would be of interest in Tez since it uses the 
ContainerManagementProtocolProxy. My summary understanding is that the default 
is to turn this proxy off and this improves things for large scale clusters. So 
when Tez moves to 2.6 then it will automatically pick the defaults (which turn 
caching off) and benefit for large clusters. Is that correct?

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171420#comment-14171420
 ] 

Jason Lowe commented on YARN-2314:
--

Yes, the patch sets the default to off since that allows all cluster sizes to 
work.  If it's crucial to default to enabled for small clusters then those with 
large clusters will have to manually configure the cache off.  Again I have yet 
to see evidence this cache is necessary, so defaulting to something that 
doesn't fail for all cluster sizes seemed like a better choice than one which 
would work for some but not others.  If you have evidence where Tez absolutely 
has to have this cache enabled that would be good to share.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2688) Better diagnostics on Container Launch failures

2014-10-14 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171426#comment-14171426
 ] 

Gera Shegalov commented on YARN-2688:
-

Localizer diagnostics was improved by YARN-2377.

 Better diagnostics on Container Launch failures
 ---

 Key: YARN-2688
 URL: https://issues.apache.org/jira/browse/YARN-2688
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Arun C Murthy

 We need better diagnostics on container launch failures due to errors like 
 localizations issues, wrong command for container launch etc. Currently, if 
 the container doesn't launch, we get nothing - not even container logs since 
 there are no logs to aggregate either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171429#comment-14171429
 ] 

Bikas Saha commented on YARN-2314:
--

To be clear, my question was only to clarify if Tez would get the benefits 
without doing anything because the defaults are correct. Looks like that is the 
case. Thanks!

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171443#comment-14171443
 ] 

Jason Lowe commented on YARN-2314:
--

So Tez will automatically benefit on large clusters because the default is to 
not use the cache.  However if we've found empirically that Tez needs the proxy 
cache to perform well then this patch would be a performance hit for Tez by 
default on clusters where the cache issues weren't a problem.  I wasn't sure 
which default benefit you were referring to above (running faster because cache 
is enabled or working on a large cluster because cache is disabled).

If Tez shows significant improvements with this cache turned on then I could 
see an argument to have the cache on by default since small clusters are common 
and large clusters are rare. 

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171456#comment-14171456
 ] 

Hadoop QA commented on YARN-2656:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674303/YARN-2656.4.patch
  against trunk revision cdce883.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.ha.TestZKFailoverControllerStress

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5388//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5388//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5388//console

This message is automatically generated.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2496) Changes for capacity scheduler to support allocate resource respect labels

2014-10-14 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171460#comment-14171460
 ] 

Vinod Kumar Vavilapalli commented on YARN-2496:
---

More comments

ParentQueue
 - assignToQueue(): We are only checking that at least one label is within 
maximum capacity. Bug?
 - assignToQueue() - canAssignToThisQueue
 - Not related to your patch, but removeApplication() can be private. Similarly 
assignContainersToChildQueues, printChildQueues.
 - Can avoid multiple calls to labelManager.getLabelsOnNode(node.getNodeID()) 
inside assignContainers.
 - getACLs() should be pushed up to AbstractQueue.
 - I think reservations are still not handled per accessible node-labels in the 
patch. We can fix it separately though.
 - Sorting queues doesn't take node-labels into account. Again, we can fix it 
separately.
 - Explicitly mark calls to allocateResource() and releaseResource with super 
for better readability.
 - We should change printChildQueues() and getChildQueuesToPrint() to print 
node-label associations too.
 - The following check in LeafQueue needs to be present in ParentQueue too?
{code}
// if our queue cannot access this node, just return
if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels,
labelManager.getLabelsOnNode(node.getNodeID( {
  return NULL_ASSIGNMENT;
}
{code}

AbstractQueue
 - queueComparator should be pushed down to ParentQueue.
 - releaseResource() should be protected

More to come.

 Changes for capacity scheduler to support allocate resource respect labels
 --

 Key: YARN-2496
 URL: https://issues.apache.org/jira/browse/YARN-2496
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2496-20141009-1.patch, YARN-2496.patch, 
 YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, 
 YARN-2496.patch, YARN-2496.patch, YARN-2496.patch


 This JIRA Includes:
 - Add/parse labels option to {{capacity-scheduler.xml}} similar to other 
 options of queue like capacity/maximum-capacity, etc.
 - Include a default-label-expression option in queue config, if an app 
 doesn't specify label-expression, default-label-expression of queue will be 
 used.
 - Check if labels can be accessed by the queue when submit an app with 
 labels-expression to queue or update ResourceRequest with label-expression
 - Check labels on NM when trying to allocate ResourceRequest on the NM with 
 label-expression
 - Respect  labels when calculate headroom/user-limit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

2014-10-14 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171459#comment-14171459
 ] 

Bikas Saha commented on YARN-2314:
--

My understanding from the comments was that in most cases this cache was adding 
overhead without benefit since the RPC layer was not controlled by the cache.

We have no empirical evidence either ways about the performance. If you know of 
cases where this change of default might cause issues, then it would be helpful 
if they were enumerated in a comment. Then Tez/other users could test for those 
cases when they upgrade to 2.6 and make their own choices.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated


[ 
https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171470#comment-14171470
 ] 

Jason Lowe commented on YARN-2312:
--

Sorry for the late reply.  +1 lgtm as well.  I noticed the patch doesn't apply 
cleanly to branch-2 because TestCheckpointPreemptionPolicy.java is missing from 
that branch.  That made me wonder if there were any changes needed for branch-2 
that aren't on trunk, but I didn't find any from a simple search.

 Marking ContainerId#getId as deprecated
 ---

 Key: YARN-2312
 URL: https://issues.apache.org/jira/browse/YARN-2312
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, 
 YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, 
 YARN-2312.4.patch, YARN-2312.5.patch, YARN-2312.6.patch, YARN-2312.7.patch


 {{ContainerId#getId}} will only return partial value of containerId, only 
 sequence number of container id without epoch, after YARN-2229. We should 
 mark {{ContainerId#getId}} as deprecated and use 
 {{ContainerId#getContainerId}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171475#comment-14171475
 ] 

Jason Lowe commented on YARN-2314:
--

The only issue I can think of is the idle timeout change that goes along with 
the cache being disabled.  Since we disable the cache by default we also, by 
default, set the cm proxy connection idle timeouts to zero. That means for each 
cm proxy RPC call we will create a new connection to the NM.  That sounds 
expensive, and probably was the motivation for the creation of the cache, but 
in practice it doesn't seem to matter (at least for the loads we tested which 
didn't include Tez).  For our case we were comparing 2.x against 0.23, and 0.23 
was slightly faster in the AM scalability test than 2.x despite 2.x having this 
cache and 0.23 lacking it.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171476#comment-14171476
 ] 

Hadoop QA commented on YARN-2314:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674827/YARN-2314.patch
  against trunk revision cdce883.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  
org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

org.apache.hadoop.yarn.client.TestRMFailover

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5389//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5389//console

This message is automatically generated.

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, disable-cm-proxy-cache.patch, 
 nmproxycachefix.prototype.patch


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171512#comment-14171512
 ] 

Steve Loughran commented on YARN-2689:
--

low-level stack replicated
{code}
javax.security.auth.login.LoginException: Unable to obtain password from user

at 
com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:856)
at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:719)
at 
com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:584)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:762)
at 
javax.security.auth.login.LoginContext.access$000(LoginContext.java:203)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:690)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at 
javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:687)
at javax.security.auth.login.LoginContext.login(LoginContext.java:595)
at org.apache.zookeeper.Login.login(Login.java:292)
at org.apache.zookeeper.Login.init(Login.java:93)
at 
org.apache.hadoop.registry.secure.TestSecureRegistry.testLowlevelZKSaslLogin(TestSecureRegistry.java:81)

{code}

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran

 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-570) Time strings are formated in different timezone

2014-10-14 Thread Ray Chiang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171535#comment-14171535
 ] 

Ray Chiang commented on YARN-570:
-

One bug.

The RM UI application table ends up with times like:

Tue Oct 14 14:4:27 -0700 2014  (for 2:04 PM)
Tue Oct 14 14:5:6 -0700 2014(for 2:05 PM)

One comment.

The RM About section always shows time local to the node the RM is 
running on.  The RM UI application table always show local time of the 
machine/browser.  That fits given the Javascript/Java discrepancy, but it could 
be confusing in a completely different way.


 Time strings are formated in different timezone
 ---

 Key: YARN-570
 URL: https://issues.apache.org/jira/browse/YARN-570
 Project: Hadoop YARN
  Issue Type: Bug
  Components: webapp
Affects Versions: 2.2.0
Reporter: Peng Zhang
Assignee: Akira AJISAKA
 Attachments: MAPREDUCE-5141.patch, YARN-570.2.patch, 
 YARN-570.3.patch, YARN-570.4.patch


 Time strings on different page are displayed in different timezone.
 If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as 
 Wed, 10 Apr 2013 08:29:56 GMT
 If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 
 16:29:56
 Same value, but different timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1542) Add unit test for public resource on viewfs

2014-10-14 Thread Gera Shegalov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated YARN-1542:

Attachment: YARN-1542.v05.patch

v05: rebasing the patch again.

 Add unit test for public resource on viewfs
 ---

 Key: YARN-1542
 URL: https://issues.apache.org/jira/browse/YARN-1542
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1542.v01.patch, YARN-1542.v02.patch, 
 YARN-1542.v03.patch, YARN-1542.v04.patch, YARN-1542.v05.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171545#comment-14171545
 ] 

Steve Loughran commented on YARN-2689:
--

login is looking for domained principal
{code}
JVM krb5.conf to: 
C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\1413321345247\krb5.conf
2014-10-14 14:15:53,044 [main] INFO  secure.AbstractSecureRegistryTest 
(AbstractSecureRegistryTest.java:setupKDCAndPrincipals(219)) - 
zookeeper { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab
 debug=true
 principal=zookeeper
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
ZOOKEEPER_SERVER { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab
 debug=true
 principal=zookeeper/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
alice { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\alice.keytab
 debug=true
 principal=alice/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
bob { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\bob.keytab
 debug=true
 principal=bob/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 

Debug is  true storeKey true useTicketCache false useKeyTab true doNotPrompt 
true ticketCache is null isInitiator true KeyTab is 
C:Workhadoop-trunkhadoop-yarn-projecthadoop-yarnhadoop-yarn-registry   
argetkdczookeeper.keytab refreshKrb5Config is false principal is 
zookeeper/localhost tryFirstPass is false useFirstPass is false storePass is 
false clearPass is false
Key for the principal zookeeper/localh...@example.com not available in 
C:Workhadoop-trunkhadoop-yarn-projecthadoop-yarnhadoop-yarn-registry 
argetkdczookeeper.keytab
[Krb5LoginModule] authentication failed 
Unable to obtain password from user
{code}

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran

 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171561#comment-14171561
 ] 

Steve Loughran commented on YARN-2689:
--

correction: login is looking for a keytab string that needs to be escaped in 
the jaas conf file

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran

 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171587#comment-14171587
 ] 

Steve Loughran commented on YARN-2689:
--

After fix ZK comes up registry not playing nice, permissions?
{code}
testUserZookeeperHomePathAccess(org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations)
  Time elapsed: 1.067 s
ec   ERROR!
org.apache.hadoop.service.ServiceStateException: 
org.apache.hadoop.registry.client.exceptions.AuthenticationFailedExcept
ion: `/registry': Authentication Failed: 
org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = Aut
hFailed for /registry: KeeperErrorCode = AuthFailed for /registry
at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at 
org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688)
at 
org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:423)
at 
org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkMkPath(CuratorService.java:544)
at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.maybeCreate(CuratorService.java:431)
at 
org.apache.hadoop.registry.server.services.RegistryAdminService.createRootRegistryPaths(RegistryAdminService.
java:246)
at 
org.apache.hadoop.registry.server.services.RegistryAdminService.serviceStart(RegistryAdminService.java:215)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at 
org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:10
5)
at 
org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:97
)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations.startRMRegistryOperations(TestSecureRMRegist
ryOperations.java:96)
at 
org.apache.hadoop.registry.secure.TestSecureRMRegistryOperations.testUserZookeeperHomePathAccess(TestSecureRM
RegistryOperations.java:226)

{code}

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran

 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1542) Add unit test for public resource on viewfs


[ 
https://issues.apache.org/jira/browse/YARN-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171594#comment-14171594
 ] 

Hadoop QA commented on YARN-1542:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674850/YARN-1542.v05.patch
  against trunk revision cdce883.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5390//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5390//console

This message is automatically generated.

 Add unit test for public resource on viewfs
 ---

 Key: YARN-1542
 URL: https://issues.apache.org/jira/browse/YARN-1542
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: YARN-1542.v01.patch, YARN-1542.v02.patch, 
 YARN-1542.v03.patch, YARN-1542.v04.patch, YARN-1542.v05.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171625#comment-14171625
 ] 

Steve Loughran commented on YARN-2689:
--

more traces; 

{code}
2014-10-14 15:07:41,573 [main] DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: classTeardown entered state 
INITED
2014-10-14 15:07:41,573 [main] DEBUG service.CompositeService 
(CompositeService.java:serviceInit(104)) - classTeardown: initing services, 
size=0
2014-10-14 15:07:41,573 [main] DEBUG service.CompositeService 
(CompositeService.java:serviceStart(115)) - classTeardown: starting services, 
size=0
2014-10-14 15:07:41,573 [main] DEBUG service.AbstractService 
(AbstractService.java:start(197)) - Service classTeardown is started
2014-10-14 15:07:41,587 [main] DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: registrySecurity entered 
state INITED
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(259)) 
- Configuration:
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(260)) 
- ---
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   debug: true
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   transport: TCP
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   max.ticket.lifetime: 8640
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   org.name: EXAMPLE
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   kdc.port: 0
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   org.domain: COM
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   max.renewable.lifetime: 60480
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   instance: DefaultKrbServer
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(262)) 
-   kdc.bind.address: localhost
2014-10-14 15:07:41,603 [main] INFO  minikdc.MiniKdc (MiniKdc.java:init(264)) 
- ---
2014-10-14 15:07:48,259 [main] INFO  minikdc.MiniKdc 
(MiniKdc.java:initKDCServer(480)) - MiniKdc listening at port: 50351
2014-10-14 15:07:48,259 [main] INFO  minikdc.MiniKdc 
(MiniKdc.java:initKDCServer(481)) - MiniKdc setting JVM krb5.conf to: 
C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\1413324461603\krb5.conf
2014-10-14 15:07:48,775 [main] INFO  secure.AbstractSecureRegistryTest 
(AbstractSecureRegistryTest.java:setupKDCAndPrincipals(219)) - 
zookeeper { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/zookeeper.keytab
 debug=true
 principal=zookeeper
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
ZOOKEEPER_SERVER { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/zookeeper.keytab
 debug=true
 principal=zookeeper/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
alice { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/alice.keytab
 debug=true
 principal=alice/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 
bob { 
 com.sun.security.auth.module.Krb5LoginModule required
 
keyTab=C:/Work/hadoop-trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/target/kdc/bob.keytab
 debug=true
 principal=bob/localhost
 useKeyTab=true
 useTicketCache=false
 doNotPrompt=true
 storeKey=true;
}; 

2014-10-14 15:07:48,900 [JUnit] INFO  secure.AbstractSecureRegistryTest 
(AbstractSecureRegistryTest.java:login(328)) - Logging in as 
zookeeper/localhost in context ZOOKEEPER_SERVER with keytab 
target\kdc\zookeeper.keytab
Debug is  true storeKey true useTicketCache true useKeyTab true doNotPrompt 
true ticketCache is null isInitiator true KeyTab is 
C:\Work\hadoop-trunk\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-registry\target\kdc\zookeeper.keytab
 refreshKrb5Config is true principal is zookeeper/localhost tryFirstPass is 
false useFirstPass is false storePass is false clearPass is false
Refreshing Kerberos configuration
Acquire TGT from Cache
Principal is zookeeper/localh...@example.com
null credentials from Ticket Cache
principal is zookeeper/localh...@example.com
Will use keytab
Commit Succeeded 

2014-10-14 15:07:49,165 [JUnit] DEBUG service.AbstractService 
(AbstractService.java:enterState(452)) - Service: 
test-testUserZookeeperHomePathAccess entered state INITED
2014-10-14 15:07:49,165 [JUnit]

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171628#comment-14171628
 ] 

Steve Loughran commented on YARN-2689:
--

key section
{code}
15:07:49 PDT 2014
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for zookee...@example.com to go to krbtgt/example@example.com 
expiring on Wed Oct 15 15:07:49 PDT 2014
Service ticket not found in the subject
KrbException: Server not found in Kerberos database (7) - Server not found in 
Kerberos database
at sun.security.krb5.KrbTgsRep.init(KrbTgsRep.java:73)
at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:192)
at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:203)
at 
sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:309)
at 
sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:115)
at 
sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:454)
at 
sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:641)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:366)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:362)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:348)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.sendSaslPacket(ZooKeeperSaslClient.java:420)
at 
org.apache.zookeeper.client.ZooKeeperSaslClient.initialize(ZooKeeperSaslClient.java:458)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1013)
Caused by: KrbException: Identifier doesn't match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:143)
at sun.security.krb5.internal.TGSRep.init(TGSRep.java:66)
at sun.security.krb5.internal.TGSRep.init(TGSRep.java:61)
at sun.security.krb5.KrbTgsRep.init(KrbTgsRep.java:55)
... 18 more
2014-10-14 15:07:49,816 [JUnit-SendThread(127.0.0.1:50366)] ERROR 
client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:createSaslToken(384)) - An 
error: (java.security.PrivilegedActionException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Server not found in Kerberos 
database (7) - Server not found in Kerberos database)]) occurred when 
evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper Client 
will go to AUTH_FAILED state.
2014-10-14 15:07:49,816 [JUnit-SendThread(127.0.0.1:50366)] ERROR 
zookeeper.ClientCnxn (ClientCnxn.java:run(1015)) - SASL authentication with 
Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: 
(java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Server not found in Kerberos database (7) - Server not found 
in Kerberos database)]) occurred when evaluating Zookeeper Quorum Member's  
received SASL token. Zookeeper Client will go to AUTH_FAILED state.
{code}

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran

 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171700#comment-14171700
 ] 

Hadoop QA commented on YARN-2689:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674868/YARN-2689-001.patch
  against trunk revision cdce883.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5391//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5391//console

This message is automatically generated.

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2689-001.patch


 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171717#comment-14171717
 ] 

Li Lu commented on YARN-2673:
-

After checking through the code, I'm planning to separate this into two steps. 
In the first step, we may want to add retry mechanisms to the jersey client 
that posts timeline entities and domains. After this step, timeline clients for 
non-secured clusters will be able to retry when timeline server is down. Then, 
on top of this, we can add retry mechanism to delegation token calls for 
secured clusters. I'll focus on the non-secured part in this Jira. All secured 
cluster/token related retry mechanisms will be uploaded to a separate Jira. 

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu

 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.


[ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171718#comment-14171718
 ] 

zhihai xu commented on YARN-2682:
-

Hi [~rusanu],
thanks for the information, I will call getWorkingDir in WSCE. So the WSCE will 
also randomly pick the local directory.

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.


 [ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2682:

Attachment: YARN-2682.001.patch

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch, YARN-2682.001.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.


[ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171733#comment-14171733
 ] 

zhihai xu commented on YARN-2682:
-

I attached a new patch YARN-2682.001.patch to call getWorkingDir  instead of 
getFirstApplicationDir in WSCE and remove getFirstApplicationDir function in 
DCE. This patch doesn't need unit test because the unit test for getWorkingDir 
is already done in TestDefaultContainerExecutor.

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch, YARN-2682.001.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user

2014-10-14 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171756#comment-14171756
 ] 

Xuan Gong commented on YARN-2656:
-

overall Looks good. One comment:
{code}
+  public void doFilter(ServletRequest request, ServletResponse response,
+  FilterChain filterChain) throws IOException, ServletException {
+HttpServletRequest req = (HttpServletRequest) request;
+// For backward compatibility, allow use of the old header field
+final String oldHeader = req.getHeader(OLD_HEADER);
+if (oldHeader != null  !oldHeader.isEmpty()) {
+  String newHeader =
+  req.getHeader(DelegationTokenAuthenticator.DELEGATION_TOKEN_HEADER);
+  if (newHeader == null || newHeader.isEmpty()) {
+HttpServletRequestWrapper wrapper = new HttpServletRequestWrapper(req) 
{
+  @Override
+  public String getHeader(String name) {
+if (name
+  .equals(DelegationTokenAuthenticator.DELEGATION_TOKEN_HEADER)) {
+  return oldHeader;
+}
+return super.getHeader(name);
+  }
+};
+super.doFilter(wrapper, response, filterChain);
   }
+} else {
+  super.doFilter(request, response, filterChain);
 }
{code}

Here, we handled case:
1)when oldHeader is null/empty
2)When oldHeader is not null/empty and newHeader is null/empty.
Do we need to handle the case when oldHeader is not null/empty and newHeader is 
not null/empty here as well ?
So, maybe we could check newHeader first. As I understand, if newHeader is not 
null/empty, it will be used no matter whether oldHeader is set or not. 

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2682) WindowsSecureContainerExecutor should not depend on DefaultContainerExecutor#getFirstApplicationDir.


[ 
https://issues.apache.org/jira/browse/YARN-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171766#comment-14171766
 ] 

Hadoop QA commented on YARN-2682:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674880/YARN-2682.001.patch
  against trunk revision cdce883.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5392//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5392//console

This message is automatically generated.

 WindowsSecureContainerExecutor should not depend on 
 DefaultContainerExecutor#getFirstApplicationDir. 
 -

 Key: YARN-2682
 URL: https://issues.apache.org/jira/browse/YARN-2682
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Attachments: YARN-2682.000.patch, YARN-2682.001.patch


 DefaultContainerExecutor won't use getFirstApplicationDir any more. But we 
 can't delete getFirstApplicationDir in DefaultContainerExecutor because 
 WindowsSecureContainerExecutor uses it.
 We should move getFirstApplicationDir function from DefaultContainerExecutor 
 to WindowsSecureContainerExecutor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2496) Changes for capacity scheduler to support allocate resource respect labels

[
https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171802#comment-14171802
]

Hadoop QA commented on YARN-2496:
-

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment

http://issues.apache.org/jira/secure/attachment/12674883/YARN-2496-20141014-1.patch
against trunk revision cdce883.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 12 new
or modified test files.

{color:red}-1 javac{color:red}. The patch appears to cause the build to
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5393//console

This message is automatically generated.

Changes for capacity scheduler to support allocate resource respect labels
--

Key: YARN-2496
URL: https://issues.apache.org/jira/browse/YARN-2496
Project: Hadoop YARN
Issue Type: Sub-task
Components: resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
Attachments: YARN-2496-20141009-1.patch, YARN-2496-20141014-1.patch,
YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch,
YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch

This JIRA Includes:
- Add/parse labels option to {{capacity-scheduler.xml}} similar to other
options of queue like capacity/maximum-capacity, etc.
- Include a default-label-expression option in queue config, if an app
doesn't specify label-expression, default-label-expression of queue will be
used.
- Check if labels can be accessed by the queue when submit an app with
labels-expression to queue or update ResourceRequest with label-expression
- Check labels on NM when trying to allocate ResourceRequest on the NM with
label-expression
- Respect labels when calculate headroom/user-limit

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171803#comment-14171803
 ] 

Varun Vasudev commented on YARN-2656:
-

If both headers are specified, we should use the new one. That code is only 
there for backwards compatibility. If the new header is specified, there's no 
need for us to do anything.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user

2014-10-14 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171813#comment-14171813
 ] 

Xuan Gong commented on YARN-2656:
-

No even need to call super.doFilter() ???

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171816#comment-14171816
 ] 

Varun Vasudev commented on YARN-2656:
-

bq. No even need to call super.doFilter() ???

I completely missed that. Great catch! [~zjshen], we need to handle the case 
Xuan pointed out.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2656) RM web services authentication filter should add support for proxy user


 [ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2656:

Assignee: Zhijie Shen  (was: Varun Vasudev)

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Zhijie Shen
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2656) RM web services authentication filter should add support for proxy user

2014-10-14 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2656:
--
Attachment: YARN-2656.5.patch

Good catch! Fix the issue in the newest patch.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Zhijie Shen
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7

2014-10-14 Thread Beckham007 (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171851#comment-14171851
 ] 

Beckham007 commented on YARN-2686:
--

[~ywskycn] looking forward for the patach. 
Should we also support the libcgroup of Redhat 7?

 CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
 -

 Key: YARN-2686
 URL: https://issues.apache.org/jira/browse/YARN-2686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Beckham007

 CgroupsLCEResourcesHandler uses , to seprating resourcesOption.
 Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So 
 container-executor would use the wrong path /sys/fs/cgroup/cpu as the 
 container task file. It should be 
 /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks.
 We should someother character instand of ,.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users

2014-10-14 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-2423:

Attachment: YARN-2423.patch

The patch adds the GET APIs.  I modeled them after the get methods in 
TimelineReader.  It also fixes a bug I ran into in the MemoryTimelineStore 
where the related entities information was not getting stored.  Besides the 
unit tests, I also verified it in an actual cluster.

 TimelineClient should wrap all GET APIs to facilitate Java users
 

 Key: YARN-2423
 URL: https://issues.apache.org/jira/browse/YARN-2423
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Robert Kanter
 Attachments: YARN-2423.patch


 TimelineClient provides the Java method to put timeline entities. It's also 
 good to wrap over all GET APIs (both entity and domain), and deserialize the 
 json response into Java POJO objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2686) CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7

2014-10-14 Thread Wei Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171890#comment-14171890
 ] 

Wei Yan commented on YARN-2686:
---

[~beckham007], Redhat doesn't recommand that way. To avoid conflicts, do not 
use libcgroup tools for default resource controllers (listed in Available 
Controllers in Red Hat Enterprise Linux 7) that are now an exclusive domain of 
systemd. So we plan to move the cpu part using systemd.

 CgroupsLCEResourcesHandler does not support the default Redhat 7/CentOS 7
 -

 Key: YARN-2686
 URL: https://issues.apache.org/jira/browse/YARN-2686
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Beckham007

 CgroupsLCEResourcesHandler uses , to seprating resourcesOption.
 Redhat 7 use /sys/fs/cgroup/cpu,cpuacct as the cpu mount dir. So 
 container-executor would use the wrong path /sys/fs/cgroup/cpu as the 
 container task file. It should be 
 /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/contain_id/tasks.
 We should someother character instand of ,.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user

2014-10-14 Thread Xuan Gong (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171918#comment-14171918
 ] 

Xuan Gong commented on YARN-2656:
-

+1 LGTM

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Zhijie Shen
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client

[
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Li Lu updated YARN-2673:

Attachment: YARN-2673-101414.patch

Upload a patch for this issue. TimelineClient will by default retry for a given
amount of time before throw the exception on posting to server. There are a few
notes:

1. Retrying vs. discarding timeline data: If we do not adding this retry,
timeline client will drop the posted data if the first attempt has failed. Had
a offline discussion with [~vinodkv]. We agreed that blocking the timeline
client for a short while is better, since we may not want to drop some critical
timeline data.

2. Retry behavior configurations: Users can define maximum retry counts, and
time interval between consecutive retries. We may want to have two levels of
retry settings: a cluster global settings, managed by yarn-site.xml, and a
per-application customize setting. For the cluster setting, I've added two
configuration properties, yarn.timeline-service.client.max-retries (default 30)
and yarn.timeline-service.client.retry-interval-ms (default 1000). I've also
provide a customizeRetrySettings method for application specific retry
settings.

3. Retry implementation: timeline client does not use RPC, but uses RESTful
APIs. I'm implementing retry as a jersey filter in this patch.

4. Tests: I added two new unit tests, one to test the customizeRetrySettings
API and the other to test if the retry has actually happened when we try to
post timeline entities.

Add retry for timeline client
-

Key: YARN-2673
URL: https://issues.apache.org/jira/browse/YARN-2673
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
Attachments: YARN-2673-101414.patch

Timeline client now does not handle the case gracefully when the server is
down. Jobs from distributed shell may fail due to ATS restart. We may need to
add some retry mechanisms to the client.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171932#comment-14171932
 ] 

Hadoop QA commented on YARN-2656:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674896/YARN-2656.5.patch
  against trunk revision 0260231.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5394//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5394//artifact/patchprocess/newPatchFindbugsWarningshadoop-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5394//console

This message is automatically generated.

 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Zhijie Shen
 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171940#comment-14171940
 ] 

Hadoop QA commented on YARN-2673:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674907/YARN-2673-101414.patch
  against trunk revision 0260231.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  org.apache.hadoop.yarn.client.api.impl.TestTimelineClient

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5396//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5396//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5396//console

This message is automatically generated.

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:

Attachment: YARN-2673-101414-1.patch

Address the comments from findbugs, and retry the unit test failure. Could not 
reproduce the UT failure locally. 

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2603) ApplicationConstants missing HADOOP_MAPRED_HOME

2014-10-14 Thread Ray Chiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Chiang updated YARN-2603:
-
Attachment: YARN-2603-01.patch

Add HADOOP_MAPRED_HOME to the list of environment variables.

 ApplicationConstants missing HADOOP_MAPRED_HOME
 ---

 Key: YARN-2603
 URL: https://issues.apache.org/jira/browse/YARN-2603
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Allen Wittenauer
  Labels: newbie
 Attachments: YARN-2603-01.patch


 The Environment enum should have HADOOP_MAPRED_HOME listed as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171975#comment-14171975
 ] 

Hadoop QA commented on YARN-2673:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674913/YARN-2673-101414-1.patch
  against trunk revision 0260231.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common:

  org.apache.hadoop.yarn.client.api.impl.TestTimelineClient

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5397//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5397//console

This message is automatically generated.

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2656) RM web services authentication filter should add support for proxy user


[ 
https://issues.apache.org/jira/browse/YARN-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14171993#comment-14171993
 ] 

Hudson commented on YARN-2656:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6262 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6262/])
YARN-2656. Made RM web services authentication filter support proxy user. 
Contributed by Varun Vasudev and Zhijie Shen. (zjshen: rev 
1220bb72d452521c6f09cebe1dd77341054ee9dd)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/token/delegation/web/DelegationTokenAuthenticationHandler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMAdminService.java


 RM web services authentication filter should add support for proxy user
 ---

 Key: YARN-2656
 URL: https://issues.apache.org/jira/browse/YARN-2656
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Varun Vasudev
Assignee: Zhijie Shen
 Fix For: 2.6.0

 Attachments: YARN-2656.3.patch, YARN-2656.4.patch, YARN-2656.5.patch, 
 apache-yarn-2656.0.patch, apache-yarn-2656.1.patch, apache-yarn-2656.2.patch


 The DelegationTokenAuthenticationFilter adds support for doAs functionality. 
 The RMAuthenticationFilter should expose this as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:

Attachment: YARN-2673-101414-2.patch

Debugging the UT failure. 

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2183) Cleaner service for cache manager

2014-10-14 Thread Sangjin Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sangjin Lee updated YARN-2183:
--
Attachment: YARN-2183-trunk-v5.patch

Patch v.5 posted.

To help see the overall diffs, you can use this github diff: 
https://github.com/ctrezzo/hadoop/compare/apache:trunk...sharedcache-3-YARN-2183-cleaner

Changes between v.4 and v.5:
- 
https://github.com/ctrezzo/hadoop/commit/1be0a159a739578a0f5d89e6881f6fb63aeccfa6
- 
https://github.com/ctrezzo/hadoop/commit/b229d3ba5f592f231455526829b09db244264fcb

 Cleaner service for cache manager
 -

 Key: YARN-2183
 URL: https://issues.apache.org/jira/browse/YARN-2183
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
 Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, 
 YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch


 Implement the cleaner service for the cache manager along with metrics for 
 the service. This service is responsible for cleaning up old resource 
 references in the manager and removing stale entries from the cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2183) Cleaner service for cache manager

2014-10-14 Thread Sangjin Lee (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172007#comment-14172007
]

Sangjin Lee commented on YARN-2183:
---

Thanks Karthik for the review. We have addressed most of your review comments
in the updated patch. Items that need more discussion are below.

(CleanerService)
{quote}
runCleanerTask: Instead of checking if there is a scheduled-cleaner-task
running here, why not just rely on the check in CleanerTask#run(). Agree, we
might be doing a little more work here unnecessarily, but not sure the savings
are worth an extra check and an extra parameter in the CleanerTask constructor.
{quote}

We do have the on-demand cleaner feature (YARN-2189; see below for more), and
this check is needed to prevent a race (i.e. not allow an on-demand run when a
scheduled run is in progress).

{quote}
How does a user use runCleanerTask? Instantiate another SCM? The SCM isn't
listening to any requests. I can see the SCM being run in the RM, and one could
potentially add yarn rmadmin -clean-shared-cache. In any case, given there is
no way to reach a running SCM, I would remove runCleanerTask altogether for
now, and add it back later when we need it? Thoughts?
{quote}

As we discussed offline, we do have a YARN admin command implemented that lets
you run the cleaner task on demand (YARN-2189). The admin command implements a
proper ACL check (based on a YARN admin credential), and sends an RPC request
to the running SCM. Since patches are organized this way, it may not have been
very obvious by looking at this patch alone.

{quote}
Should we worry about users starting SCMs with roots at different levels that
can lead to multiple cleaners?
{quote}

In theory it is possible. However, in reality checking that might be being bit
too cautious. I think it might be fine without this check. Let me know what you
think.

(CleanerTask)
{quote}
Should cleanResourceReferences be moved to SCMStore?
{quote}

That’s an interesting suggestion. In fact, with the InMemorySCMStore (which
already has a reference to an AppChecker to clean up the initial apps) it may
be OK to create SCMStore.cleanResourceReferences(). However, it’s not clear to
me whether a dependency from an SCMStore to an AppChecker is always a fine
requirement for other types of stores. In that sense, I would be hesitant to
create this coupling by introducing SCMStore.cleanResourceReferences(). What do
you think?

{quote}
For the race condition (YARN-2663), would it help to handle the delete files on
HDFS in the store#remove?
{quote}

That’s a possibility. However, I think there can be an easier way to fix this
race condition. This is partially due to the way the cleaner task is deleting
unused files. Currently it deletes the entire directory as opposed to the
specific file. The race can be fixed by avoiding deleting the directory. We’ll
add the proper fix later on YARN-2663.

(CleanerMetrics)
{quote}
Make initSingleton private and call it in getInstance if the instance is null?
{quote}

We looked into that, but the difficulty is initSingleton() needs the
configuration which getInstance() does not provide.

{quote}
How about using MutableRate or MutableStat for the rates?
{quote}

We have removed the rate-specific metrics altogether as they can be derived
from the original metrics.

{quote}
Do we need CleanerMetricsCollector, wouldn't CleanerMetrics extending
MetricsSource suffice?
{quote}

For making the metrics available in JMX, indeed CleanerMetrics would suffice.
CleanerMetricsCollector was introduced to create a web UI (HTML) to select a
handful of metrics to show in HTML. That was to come in YARN-2203. Having said
that, I think we can remove CleanerMetricsCollector for now, and not introduce
the web UI. That UI is minimal anyway, and we can consider introducing a more
refined web UI at a later point once this version is released.

Cleaner service for cache manager
-

Key: YARN-2183
URL: https://issues.apache.org/jira/browse/YARN-2183
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Chris Trezzo
Assignee: Chris Trezzo
Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch,
YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch

Implement the cleaner service for the cache manager along with metrics for
the service. This service is responsible for cleaning up old resource
references in the manager and removing stale entries from the cache.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:

Attachment: (was: YARN-2673-101414-2.patch)

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client


 [ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:

Attachment: YARN-2673-101414-2.patch

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172026#comment-14172026
 ] 

Hadoop QA commented on YARN-2673:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12674921/YARN-2673-101414-2.patch
  against trunk revision 1220bb7.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5399//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5399//console

This message is automatically generated.

 Add retry for timeline client
 -

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2183) Cleaner service for cache manager