date:20150108

[jira] [Commented] (YARN-2997) NM keeps sending already-sent completed containers to RM until containers are removed from context

2015-01-08 Thread Chengbing Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270347#comment-14270347
 ] 

Chengbing Liu commented on YARN-2997:
-

Thanks [~jianhe] !

 NM keeps sending already-sent completed containers to RM until containers are 
 removed from context
 --

 Key: YARN-2997
 URL: https://issues.apache.org/jira/browse/YARN-2997
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Fix For: 2.7.0

 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, 
 YARN-2997.5.patch, YARN-2997.patch


 We have seen in RM log a lot of
 {quote}
 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
 Null container completed...
 {quote}
 It is caused by NM sending completed containers repeatedly until the app is 
 finished. On the RM side, the container is already released, hence 
 {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3016) (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager

2015-01-08 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270518#comment-14270518
 ] 

Rohith commented on YARN-3016:
--

It make sense to me.

 (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in 
 CommonNodeLabelsManager
 -

 Key: YARN-3016
 URL: https://issues.apache.org/jira/browse/YARN-3016
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan

 Now we have separated but similar implementations for add/remove/replace 
 labels on node in CommonNodeLabelsManager, we should merge it to a single one 
 for easier modify them and better readability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2807) Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive

2015-01-08 Thread Akira AJISAKA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270522#comment-14270522
 ] 

Akira AJISAKA commented on YARN-2807:
-

+1, thank you [~iwasakims].

 Option --forceactive not works as described in usage of yarn rmadmin 
 -transitionToActive
 

 Key: YARN-2807
 URL: https://issues.apache.org/jira/browse/YARN-2807
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation, resourcemanager
Reporter: Wangda Tan
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-2807.1.patch, YARN-2807.2.patch, YARN-2807.3.patch


 Currently the help message of yarn rmadmin -transitionToActive is:
 {code}
 transitionToActive: incorrect number of arguments
 Usage: HAAdmin [-transitionToActive serviceId [--forceactive]]
 {code}
 But the --forceactive not works as expected. When transition RM state with 
 --forceactive:
 {code}
 yarn rmadmin -transitionToActive rm2 --forceactive
 Automatic failover is enabled for 
 org.apache.hadoop.yarn.client.RMHAServiceTarget@64c9f31e
 Refusing to manually manage HA state, since it may cause
 a split-brain scenario or other incorrect state.
 If you are very sure you know what you are doing, please
 specify the forcemanual flag.
 {code}
 As shown above, we still cannot transitionToActive when automatic failover is 
 enabled with --forceactive.
 The option can work is: {{--forcemanual}}, there's no place in usage 
 describes this option. I think we should fix this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2956) Some yarn-site index linked pages are difficult to discover because are not in the side bar

2015-01-08 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270255#comment-14270255
 ] 

Jian He commented on YARN-2956:
---

[~iwasakims], thanks for working on this. maybe add a link in the side bar  to 
link to the  hadoop-yarn-site/index.html too ? The link can be located at the 
YARN section on the side bar and may call it overview. 
I think it's fine to keep a full list of document indexes in the main index too 
?

 Some yarn-site index linked pages are difficult to discover because are not 
 in the side bar
 ---

 Key: YARN-2956
 URL: https://issues.apache.org/jira/browse/YARN-2956
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.6.0
Reporter: Remus Rusanu
Assignee: Masatake Iwasaki
Priority: Minor
  Labels: documentation
 Attachments: YARN-2956.1.patch


 The yarn-site index.apt.vm page is difficult to 'stumble upon' because the 
 hadoop.apache.org/ sidebar navigation does not link to it. One needs to know 
 the URL http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/ to 
 land on it.
 The links from the index page do not match the links from the side bar, so 
 some pages are quickly accessible (from sidebar).
 I propose the links from the index.apt.vm to match the links from the YARN 
 side bar subsection (Ideally through one single definition file, but I don't 
 understand the APT generation process well enough to call out how this can be 
 achieved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2996) Refine fs operations in FileSystemRMStateStore and few fixes

2015-01-08 Thread Yi Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270274#comment-14270274
 ] 

Yi Liu commented on YARN-2996:
--

Thanks [~zjshen] for review and commit.

 Refine fs operations in FileSystemRMStateStore and few fixes
 

 Key: YARN-2996
 URL: https://issues.apache.org/jira/browse/YARN-2996
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Yi Liu
Assignee: Yi Liu
 Fix For: 2.7.0

 Attachments: YARN-2996.001.patch, YARN-2996.002.patch, 
 YARN-2996.003.patch, YARN-2996.004.patch


 In {{FileSystemRMStateStore}}, we can refine some fs operations to improve 
 performance:
 *1.* There are several places invoke {{fs.exists}}, then 
 {{fs.getFileStatus}}, we can merge them to save one RPC call
 {code}
 if (fs.exists(versionNodePath)) {
 FileStatus status = fs.getFileStatus(versionNodePath);
 {code}
 *2.*
 {code}
 protected void updateFile(Path outputPath, byte[] data) throws Exception {
   Path newPath = new Path(outputPath.getParent(), outputPath.getName() + 
 .new);
   // use writeFile to make sure .new file is created atomically
   writeFile(newPath, data);
   replaceFile(newPath, outputPath);
 }
 {code}
 The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then 
 rename to _output\_file_.new, then rename it to _output\_file_, we can reduce 
 one rename operation.
 Also there is one unnecessary import, we can remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3022) Expose Container resource information from NodeManager for monitoring

2015-01-08 Thread Anubhav Dhoot (JIRA)

Anubhav Dhoot created YARN-3022:
---

 Summary: Expose Container resource information from NodeManager 
for monitoring
 Key: YARN-3022
 URL: https://issues.apache.org/jira/browse/YARN-3022
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot


Along with exposing resource consumption of each container such as (YARN-2141) 
its worth exposing the actual resource limit associated with them to get better 
insight into YARN allocation and consumption



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU

2015-01-08 Thread Wei Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-810:
-
Attachment: YARN-810-6.patch

Update a patch to fix the test failures.

 Support CGroup ceiling enforcement on CPU
 -

 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
 Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810-5.patch, 
 YARN-810-6.patch, YARN-810.patch, YARN-810.patch


 Problem statement:
 YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
 Containers are then allowed to request vcores between the minimum and maximum 
 defined in the yarn-site.xml.
 In the case where a single-threaded container requests 1 vcore, with a 
 pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
 the core it's using, provided that no other container is also using it. This 
 happens, even though the only guarantee that YARN/CGroups is making is that 
 the container will get at least 1/4th of the core.
 If a second container then comes along, the second container can take 
 resources from the first, provided that the first container is still getting 
 at least its fair share (1/4th).
 There are certain cases where this is desirable. There are also certain cases 
 where it might be desirable to have a hard limit on CPU usage, and not allow 
 the process to go above the specified resource requirement, even if it's 
 available.
 Here's an RFC that describes the problem in more detail:
 http://lwn.net/Articles/336127/
 Solution:
 As it happens, when CFS is used in combination with CGroups, you can enforce 
 a ceiling using two files in cgroups:
 {noformat}
 cpu.cfs_quota_us
 cpu.cfs_period_us
 {noformat}
 The usage of these two files is documented in more detail here:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
 Testing:
 I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, 
 it behaves as described above (it is a soft cap, and allows containers to use 
 more than they asked for). I then tested CFS CPU quotas manually with YARN.
 First, you can see that CFS is in use in the CGroup, based on the file names:
 {noformat}
 [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
 total 0
 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
 drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
 -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
 -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
 -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
 10
 [criccomi@eat1-qa464 ~]$ sudo -u app cat
 /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
 -1
 {noformat}
 Oddly, it appears that the cfs_period_us is set to .1s, not 1s.
 We can place processes in hard limits. I have process 4370 running YARN 
 container container_1371141151815_0003_01_03 on a host. By default, it's 
 running at ~300% cpu usage.
 {noformat}
 CPU
 4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
 {noformat}
 When I set the CFS quote:
 {noformat}
 echo 1000  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
  CPU
 4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
 {noformat}
 It drops to 1% usage, and you can see the box has room to spare:
 {noformat}
 Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si, 
 0.0%st
 {noformat}
 Turning the quota back to -1:
 {noformat}
 echo -1  
 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 {noformat}
 Burns the cores again:
 {noformat}
 Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
 0.0%st
 CPU
 4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
 {noformat}
 On my dev box, I was testing CGroups by running a python process eight times, 
 to burn through all the cores, since it was doing as described above (giving 
 extra CPU to the process, even with a cpu.shares limit). Toggling the 
 cfs_quota_us seems to enforce a hard limit.

[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-01-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-313:

Attachment: YARN-313-v3.patch

Resync the patch to latest trunk.

 Add Admin API for supporting node resource configuration in command line
 

 Key: YARN-313
 URL: https://issues.apache.org/jira/browse/YARN-313
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
 YARN-313-v2.patch, YARN-313-v3.patch


 We should provide some admin interface, e.g. yarn rmadmin -refreshResources 
 to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3014) Replaces labels on a host should update all NM's labels on that host

2015-01-08 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3014:
-
Attachment: YARN-3014-2.patch

Thanks [~jianhe]'s review, addressed comments and also updated test cases to 
cover them.

 Replaces labels on a host should update all NM's labels on that host
 

 Key: YARN-3014
 URL: https://issues.apache.org/jira/browse/YARN-3014
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3014-1.patch, YARN-3014-2.patch


 Admin can either specify labels on a host (by running {{yarn rmadmin 
 -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn 
 rmadmin -replaceLabelsOnNode host1:port,label1}}).
 If user has specified label=x on a NM (instead of host), and later set the 
 label=y on host of the NM. NM's label should update to y as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread zhihai xu (JIRA)

zhihai xu created YARN-3023:
---

 Summary: Race condition in ZKRMStateStore#createWithRetries from 
ZooKeeper cause RM crash 
 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu


Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
crash.

The sequence for the Race condition is the following:
1, RM Store attempt state to ZK by calling createWithRetries
{code}
2015-01-06 12:37:35,343 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: 
container_1418914202950_42363_01_01,
{code}

2. unluckily ConnectionLoss for the ZK session happened at the same time as RM 
Stored attempt state to ZK.
The ZooKeeper server created the node and store the data successfully, But due 
to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
succeeded.
{code}
2015-01-06 12:37:36,102 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss
{code}

3.RM did retry to store attempt state to ZK after one second
{code}
2015-01-06 12:37:36,104 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying 
operation on ZK. Retry no. 1
{code}

4. during the one second interval, the ZK session is reconnected.
{code}
2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established initiating session
2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
timeout = 1
{code}

5. Because the node was created successfully at ZooKeeper in the first 
try(runWithCheck),
For the second try, it will fail with NodeExists KeeperException
{code}
2015-01-06 12:37:37,116 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
2015-01-06 12:37:37,118 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
out ZK retries. Giving up!
{code}

6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
RMStateStore
{code}
2015-01-06 12:37:37,118 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
storing appAttempt: appattempt_1418914202950_42363_01
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
RMFatalEventType type;
if (failureCause instanceof StoreFencedException) {
  type = RMFatalEventType.STATE_STORE_FENCED;
} else {
  type = RMFatalEventType.STATE_STORE_OP_FAILED;
}
rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
RMFatalEvent.
{code}
2015-01-06 12:37:37,128 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3014) Replaces labels on a host should update all NM's labels on that host

2015-01-08 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270475#comment-14270475
 ] 

Hadoop QA commented on YARN-3014:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691003/YARN-3014-2.patch
  against trunk revision ae91b13.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6287//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6287//console

This message is automatically generated.

 Replaces labels on a host should update all NM's labels on that host
 

 Key: YARN-3014
 URL: https://issues.apache.org/jira/browse/YARN-3014
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3014-1.patch, YARN-3014-2.patch


 Admin can either specify labels on a host (by running {{yarn rmadmin 
 -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn 
 rmadmin -replaceLabelsOnNode host1:port,label1}}).
 If user has specified label=x on a NM (instead of host), and later set the 
 label=y on host of the NM. NM's label should update to y as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU

2015-01-08 Thread Hadoop QA (JIRA)

[
https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270492#comment-14270492
]

Hadoop QA commented on YARN-810:

{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12691002/YARN-810-6.patch
against trunk revision ae91b13.

{color:green}+1 @author{color}. The patch does not contain any @author
tags.

{color:green}+1 tests included{color}. The patch appears to include 8 new
or modified test files.

{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.

{color:green}+1 javadoc{color}. There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}. The patch built with
eclipse:eclipse.

{color:green}+1 findbugs{color}. The patch does not introduce any new
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.

{color:red}-1 core tests{color}. The patch failed these unit tests in
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService

Test results:
https://builds.apache.org/job/PreCommit-YARN-Build/6286//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6286//console

This message is automatically generated.

Support CGroup ceiling enforcement on CPU
-

Key: YARN-810
URL: https://issues.apache.org/jira/browse/YARN-810
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini
Assignee: Sandy Ryza
Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810-5.patch,
YARN-810-6.patch, YARN-810.patch, YARN-810.patch

Problem statement:
YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio.
Containers are then allowed to request vcores between the minimum and maximum
defined in the yarn-site.xml.
In the case where a single-threaded container requests 1 vcore, with a
pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of
the core it's using, provided that no other container is also using it. This
happens, even though the only guarantee that YARN/CGroups is making is that
the container will get at least 1/4th of the core.
If a second container then comes along, the second container can take
resources from the first, provided that the first container is still getting
at least its fair share (1/4th).
There are certain cases where this is desirable. There are also certain cases
where it might be desirable to have a hard limit on CPU usage, and not allow
the process to go above the specified resource requirement, even if it's
available.
Here's an RFC that describes the problem in more detail:
http://lwn.net/Articles/336127/
Solution:
As it happens, when CFS is used in combination with CGroups, you can enforce
a ceiling using two files in cgroups:
{noformat}
cpu.cfs_quota_us
cpu.cfs_period_us
{noformat}
The usage of these two files is documented in more detail here:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html
Testing:
I have tested YARN CGroups using the 2.0.5-alpha implementation. By default,
it behaves as described above (it is a soft cap, and allows containers to use
more than they asked for). I then tested CFS CPU quotas manually with YARN.
First, you can see that CFS is in use in the CGroup, based on the file names:
{noformat}
[criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
total 0
-r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
-r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
-rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
-rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
[criccomi@eat1-qa464 ~]$ sudo -u app cat

[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line

2015-01-08 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270510#comment-14270510
 ] 

Hadoop QA commented on YARN-313:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691018/YARN-313-v3.patch
  against trunk revision ae91b13.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6288//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6288//console

This message is automatically generated.

 Add Admin API for supporting node resource configuration in command line
 

 Key: YARN-313
 URL: https://issues.apache.org/jira/browse/YARN-313
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-313-sample.patch, YARN-313-v1.patch, 
 YARN-313-v2.patch, YARN-313-v3.patch


 We should provide some admin interface, e.g. yarn rmadmin -refreshResources 
 to support changes of node's resource specified in a config file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3022) Expose Container resource information from NodeManager for monitoring

2015-01-08 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-3022:

Attachment: YARN-3022.001.patch

Initial patch based on YARN-2984 which adds metrics for Containers

 Expose Container resource information from NodeManager for monitoring
 -

 Key: YARN-3022
 URL: https://issues.apache.org/jira/browse/YARN-3022
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-3022.001.patch


 Along with exposing resource consumption of each container such as 
 (YARN-2141) its worth exposing the actual resource limit associated with them 
 to get better insight into YARN allocation and consumption



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.

2015-01-08 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270502#comment-14270502
 ] 

Hadoop QA commented on YARN-2637:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691012/YARN-2637.36.patch
  against trunk revision ae91b13.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6289//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6289//console

This message is automatically generated.

 maximum-am-resource-percent could be respected for both LeafQueue/User when 
 trying to activate applications.
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, 
 YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, 
 YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, 
 YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270526#comment-14270526
 ] 

Rohith commented on YARN-3023:
--

Which version of Hadoop are you using? In trunk this is handled, If node 
already exists then ZKRMStateStore wont throw NodeExists
{code}
catch (KeeperException ke) {
  if (ke.code() == Code.NODEEXISTS) {
LOG.info(znode already exists!);
return null;
  }
// other code
}
{code}

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash 
 -

 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash.
 The sequence for the Race condition is the following:
 1, RM Store attempt state to ZK by calling createWithRetries
 {code}
 2015-01-06 12:37:35,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
 appattempt_1418914202950_42363_01 MasterContainer: Container: 
 [ContainerId: container_1418914202950_42363_01_01,
 {code}
 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
 RM Stored attempt state to ZK.
 The ZooKeeper server created the node and store the data successfully, But 
 due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
 succeeded.
 {code}
 2015-01-06 12:37:36,102 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss
 {code}
 3.RM did retry to store attempt state to ZK after one second
 {code}
 2015-01-06 12:37:36,104 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Retrying operation on ZK. Retry no. 1
 {code}
 4. during the one second interval, the ZK session is reconnected.
 {code}
 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established initiating session
 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
 timeout = 1
 {code}
 5. Because the node was created successfully at ZooKeeper in the first 
 try(runWithCheck),
 For the second try, it will fail with NodeExists KeeperException
 {code}
 2015-01-06 12:37:37,116 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,118 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 {code}
 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
 RMStateStore
 {code}
 2015-01-06 12:37:37,118 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 storing appAttempt: appattempt_1418914202950_42363_01
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 {code}
 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
 ResourceManager
 {code}
   protected void notifyStoreOperationFailed(Exception failureCause) {
 RMFatalEventType type;
 if (failureCause instanceof StoreFencedException) {
   type = RMFatalEventType.STATE_STORE_FENCED;
 } else {
   type = RMFatalEventType.STATE_STORE_OP_FAILED;
 }
 rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
 failureCause));
   }
 {code}
 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
 RMFatalEvent.
 {code}
 2015-01-06 12:37:37,128 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2807) Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive

2015-01-08 Thread Akira AJISAKA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA updated YARN-2807:

Hadoop Flags: Reviewed

 Option --forceactive not works as described in usage of yarn rmadmin 
 -transitionToActive
 

 Key: YARN-2807
 URL: https://issues.apache.org/jira/browse/YARN-2807
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: documentation, resourcemanager
Reporter: Wangda Tan
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: YARN-2807.1.patch, YARN-2807.2.patch, YARN-2807.3.patch


 Currently the help message of yarn rmadmin -transitionToActive is:
 {code}
 transitionToActive: incorrect number of arguments
 Usage: HAAdmin [-transitionToActive serviceId [--forceactive]]
 {code}
 But the --forceactive not works as expected. When transition RM state with 
 --forceactive:
 {code}
 yarn rmadmin -transitionToActive rm2 --forceactive
 Automatic failover is enabled for 
 org.apache.hadoop.yarn.client.RMHAServiceTarget@64c9f31e
 Refusing to manually manage HA state, since it may cause
 a split-brain scenario or other incorrect state.
 If you are very sure you know what you are doing, please
 specify the forcemanual flag.
 {code}
 As shown above, we still cannot transitionToActive when automatic failover is 
 enabled with --forceactive.
 The option can work is: {{--forcemanual}}, there's no place in usage 
 describes this option. I think we should fix this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread zhihai xu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270581#comment-14270581
 ] 

zhihai xu commented on YARN-3023:
-

Yes, you are right. The issue is the same as YARN-2721.

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash 
 -

 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash.
 The sequence for the Race condition is the following:
 1, RM Store attempt state to ZK by calling createWithRetries
 {code}
 2015-01-06 12:37:35,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
 appattempt_1418914202950_42363_01 MasterContainer: Container: 
 [ContainerId: container_1418914202950_42363_01_01,
 {code}
 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
 RM Stored attempt state to ZK.
 The ZooKeeper server created the node and store the data successfully, But 
 due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
 succeeded.
 {code}
 2015-01-06 12:37:36,102 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss
 {code}
 3.RM did retry to store attempt state to ZK after one second
 {code}
 2015-01-06 12:37:36,104 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Retrying operation on ZK. Retry no. 1
 {code}
 4. during the one second interval, the ZK session is reconnected.
 {code}
 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established initiating session
 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
 timeout = 1
 {code}
 5. Because the node was created successfully at ZooKeeper in the first 
 try(runWithCheck),
 For the second try, it will fail with NodeExists KeeperException
 {code}
 2015-01-06 12:37:37,116 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,118 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 {code}
 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
 RMStateStore
 {code}
 2015-01-06 12:37:37,118 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 storing appAttempt: appattempt_1418914202950_42363_01
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 {code}
 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
 ResourceManager
 {code}
   protected void notifyStoreOperationFailed(Exception failureCause) {
 RMFatalEventType type;
 if (failureCause instanceof StoreFencedException) {
   type = RMFatalEventType.STATE_STORE_FENCED;
 } else {
   type = RMFatalEventType.STATE_STORE_OP_FAILED;
 }
 rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
 failureCause));
   }
 {code}
 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
 RMFatalEvent.
 {code}
 2015-01-06 12:37:37,128 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

2015-01-08 Thread zhihai xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved YARN-3023.
-
Resolution: Duplicate

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash 
 -

 Key: YARN-3023
 URL: https://issues.apache.org/jira/browse/YARN-3023
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu

 Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM 
 crash.
 The sequence for the Race condition is the following:
 1, RM Store attempt state to ZK by calling createWithRetries
 {code}
 2015-01-06 12:37:35,343 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 Storing attempt: AppId: application_1418914202950_42363 AttemptId: 
 appattempt_1418914202950_42363_01 MasterContainer: Container: 
 [ContainerId: container_1418914202950_42363_01_01,
 {code}
 2. unluckily ConnectionLoss for the ZK session happened at the same time as 
 RM Stored attempt state to ZK.
 The ZooKeeper server created the node and store the data successfully, But 
 due to ConnectionLoss, RM didn't know the operation (createWithRetries) is 
 succeeded.
 {code}
 2015-01-06 12:37:36,102 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss
 {code}
 3.RM did retry to store attempt state to ZK after one second
 {code}
 2015-01-06 12:37:36,104 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Retrying operation on ZK. Retry no. 1
 {code}
 4. during the one second interval, the ZK session is reconnected.
 {code}
 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket 
 connection established initiating session
 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session 
 establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated 
 timeout = 1
 {code}
 5. Because the node was created successfully at ZooKeeper in the first 
 try(runWithCheck),
 For the second try, it will fail with NodeExists KeeperException
 {code}
 2015-01-06 12:37:37,116 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
 Exception while executing a ZK operation.
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,118 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed 
 out ZK retries. Giving up!
 {code}
 6.This NodeExists KeeperException will cause Storing AppAttempt failure in 
 RMStateStore
 {code}
 2015-01-06 12:37:37,118 ERROR 
 org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error 
 storing appAttempt: appattempt_1418914202950_42363_01
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 {code}
 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to 
 ResourceManager
 {code}
   protected void notifyStoreOperationFailed(Exception failureCause) {
 RMFatalEventType type;
 if (failureCause instanceof StoreFencedException) {
   type = RMFatalEventType.STATE_STORE_FENCED;
 } else {
   type = RMFatalEventType.STATE_STORE_OP_FAILED;
 }
 rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, 
 failureCause));
   }
 {code}
 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED 
 RMFatalEvent.
 {code}
 2015-01-06 12:37:37,128 FATAL 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
 org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
 STATE_STORE_OP_FAILED. Cause:
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
 NodeExists
 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
 status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3024) LocalizerRunner should give DIE action when all resources are localized

2015-01-08 Thread Chengbing Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated YARN-3024:

Attachment: YARN-3024.02.patch

Fixed tests accordingly.

 LocalizerRunner should give DIE action when all resources are localized
 ---

 Key: YARN-3024
 URL: https://issues.apache.org/jira/browse/YARN-3024
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Chengbing Liu
Assignee: Chengbing Liu
 Attachments: YARN-3024.01.patch, YARN-3024.02.patch


 We have observed that {{LocalizerRunner}} always gives a LIVE action at the 
 end of localization process.
 The problem is {{findNextResource()}} can return null even when {{pending}} 
 was not empty prior to the call. This method removes localized resources from 
 {{pending}}, therefore we should check the return value, and gives DIE action 
 when it returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2141) [Umbrella] Capture container and node resource consumption

2015-01-08 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270535#comment-14270535
 ] 

Vinod Kumar Vavilapalli commented on YARN-2141:
---

One other related effort is YARN-2928 which is also planning to obtain and send 
information about container resource-usage to a per-application aggregator. We 
should try to unify these..

 [Umbrella] Capture container and node resource consumption
 --

 Key: YARN-2141
 URL: https://issues.apache.org/jira/browse/YARN-2141
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Carlo Curino
Priority: Minor

 Collecting per-container and per-node resource consumption statistics in a 
 fairly granular manner, and making them available to both infrastructure code 
 (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can 
 facilitate several performance work. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2984) Metrics for container's actual memory usage

2015-01-08 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270536#comment-14270536
 ] 

Vinod Kumar Vavilapalli commented on YARN-2984:
---

Linking related efforts.

One other related effort is YARN-2928 which is also planning to obtain and send 
information about container resource-usage to a per-application aggregator. We 
should try to unify these..

 Metrics for container's actual memory usage
 ---

 Key: YARN-2984
 URL: https://issues.apache.org/jira/browse/YARN-2984
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
 Attachments: yarn-2984-prelim.patch


 It would be nice to capture resource usage per container, for a variety of 
 reasons. This JIRA is to track memory usage. 
 YARN-2965 tracks the resource usage on the node, and the two implementations 
 should reuse code as much as possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.

2015-01-08 Thread Craig Welch (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-2637:
--
Attachment: YARN-2637.36.patch

Should be down to one failing test, let's see

 maximum-am-resource-percent could be respected for both LeafQueue/User when 
 trying to activate applications.
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, 
 YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, 
 YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, 
 YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3019) Enable RM work-preserving restart by default

2015-01-08 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270087#comment-14270087
 ] 

Allen Wittenauer commented on YARN-3019:


I'm not in favor of this going into branch-2.  It's a fundamental change to 
operating expectations that may have a significant impact on capacity.

 Enable RM work-preserving restart by default 
 -

 Key: YARN-3019
 URL: https://issues.apache.org/jira/browse/YARN-3019
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 The proposal is to set 
 yarn.resourcemanager.work-preserving-recovery.enabled to true by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3019) Enable RM work-preserving restart by default

2015-01-08 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270100#comment-14270100
 ] 

Jian He commented on YARN-3019:
---

To clarify: this jira is to flip recovery mode to work-preserving recovery from 
non-work-preserving recovery. The feature itself remains disabled.  i.e. 
yarn.resourcemanager.recovery.enabled remains false.  updating the 
description. 

Further, I'm also thinking to enable the feature itself by default and use the 
local FS as the default file system. I'm OK to do this only on trunk. That will 
uncover bugs if any. I can open a separate jira for this.

 Enable RM work-preserving restart by default 
 -

 Key: YARN-3019
 URL: https://issues.apache.org/jira/browse/YARN-3019
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 The proposal is to set 
 yarn.resourcemanager.work-preserving-recovery.enabled to true by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2015-01-08 Thread Chen He (JIRA)

[
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270110#comment-14270110
]

Chen He commented on YARN-1680:
---

Thank you for the comments, [~jlowe]. [~cwelch] created YARN-2848 that discuss
about blacklist node and label scheduling. I will work on a patch that fixes
the blacklisted node case.

availableResources sent to applicationMaster in heartbeat should exclude
blacklistedNodes free memory.
--

Key: YARN-1680
URL: https://issues.apache.org/jira/browse/YARN-1680
Project: Hadoop YARN
Issue Type: Sub-task
Affects Versions: 2.2.0, 2.3.0
Environment: SuSE 11 SP2 + Hadoop-2.3
Reporter: Rohith
Assignee: Chen He
Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch,
YARN-1680-v2.patch, YARN-1680.patch

There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster
slow start is set to 1.
Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is
become unstable(3 Map got killed), MRAppMaster blacklisted unstable
NodeManager(NM-4). All reducer task are running in cluster now.
MRAppMaster does not preempt the reducers because for Reducer preemption
calculation, headRoom is considering blacklisted nodes memory. This makes
jobs to hang forever(ResourceManager does not assing any new containers on
blacklisted nodes but returns availableResouce considers cluster free
memory).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly

2015-01-08 Thread Harsh J (JIRA)

Harsh J created YARN-3021:
-

 Summary: YARN's delegation-token handling disallows certain trust 
setups to operate properly
 Key: YARN-3021
 URL: https://issues.apache.org/jira/browse/YARN-3021
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.3.0
Reporter: Harsh J


Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and 
B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
clusters.

Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as 
it attempts a renewDelegationToken(…) synchronously during application 
submission (to validate the managed token before it adds it to a scheduler for 
automatic renewal). The call obviously fails cause B realm will not trust A's 
credentials (here, the RM's principal is the renewer).

In the 1.x JobTracker the same call is present, but it is done asynchronously 
and once the renewal attempt failed we simply ceased to schedule any further 
attempts of renewals, rather than fail the job immediately.

We should change the logic such that we attempt the renewal but go easy on the 
failure and skip the scheduling alone, rather than bubble back an error to the 
client, failing the app submission. This way the old behaviour is retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.

2015-01-08 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270116#comment-14270116
 ] 

Jian He commented on YARN-2637:
---

Quick thing, YARN-3010 fixed the find bug warning. the findbugs exclusion in 
the patch maybe not needed.

 maximum-am-resource-percent could be respected for both LeafQueue/User when 
 trying to activate applications.
 

 Key: YARN-2637
 URL: https://issues.apache.org/jira/browse/YARN-2637
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Wangda Tan
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2637.0.patch, YARN-2637.1.patch, 
 YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, 
 YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, 
 YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, 
 YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, 
 YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, 
 YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, 
 YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.6.patch, YARN-2637.7.patch, 
 YARN-2637.9.patch


 Currently, number of AM in leaf queue will be calculated in following way:
 {code}
 max_am_resource = queue_max_capacity * maximum_am_resource_percent
 #max_am_number = max_am_resource / minimum_allocation
 #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor
 {code}
 And when submit new application to RM, it will check if an app can be 
 activated in following way:
 {code}
 for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); 
  i.hasNext(); ) {
   FiCaSchedulerApp application = i.next();
   
   // Check queue limit
   if (getNumActiveApplications() = getMaximumActiveApplications()) {
 break;
   }
   
   // Check user limit
   User user = getUser(application.getUser());
   if (user.getActiveApplications()  
 getMaximumActiveApplicationsPerUser()) {
 user.activateApplication();
 activeApplications.add(application);
 i.remove();
 LOG.info(Application  + application.getApplicationId() +
  from user:  + application.getUser() + 
  activated in queue:  + getQueueName());
   }
 }
 {code}
 An example is,
 If a queue has capacity = 1G, max_am_resource_percent  = 0.2, the maximum 
 resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be 
 launched is 200, and if user uses 5M for each AM ( minimum_allocation). All 
 apps can still be activated, and it will occupy all resource of a queue 
 instead of only a max_am_resource_percent of a queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3019) Enable RM work-preserving restart by default

2015-01-08 Thread Jian He (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3019:
--
Description: The proposal is to set 
yarn.resourcemanager.work-preserving-recovery.enabled to true by default   to 
flip recovery mode to work-preserving recovery from non-work-preserving 
recovery.   (was: The proposal is to set 
yarn.resourcemanager.work-preserving-recovery.enabled to true by default. )

 Enable RM work-preserving restart by default 
 -

 Key: YARN-3019
 URL: https://issues.apache.org/jira/browse/YARN-3019
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Jian He

 The proposal is to set 
 yarn.resourcemanager.work-preserving-recovery.enabled to true by default   
 to flip recovery mode to work-preserving recovery from non-work-preserving 
 recovery. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3014) Replaces labels on a host should update all NM's labels on that host

2015-01-08 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270179#comment-14270179
 ] 

Jian He commented on YARN-3014:
---

 addLabel and removeLabel on the Host should not do replaceLabel on the NM ?


 Replaces labels on a host should update all NM's labels on that host
 

 Key: YARN-3014
 URL: https://issues.apache.org/jira/browse/YARN-3014
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-3014-1.patch


 Admin can either specify labels on a host (by running {{yarn rmadmin 
 -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn 
 rmadmin -replaceLabelsOnNode host1:port,label1}}).
 If user has specified label=x on a NM (instead of host), and later set the 
 label=y on host of the NM. NM's label should update to y as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

97 matches

Mail list logo