[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp

2015-04-17 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500368#comment-14500368
 ] 

Yongjun Zhang commented on YARN-3021:
-

Thanks also to [~ka...@cloudera.com] for the earlier discussions, and we worked 
out a release notes which I just updated.


> YARN's delegation-token handling disallows certain trust setups to operate 
> properly over DistCp
> ---
>
> Key: YARN-3021
> URL: https://issues.apache.org/jira/browse/YARN-3021
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.3.0
>Reporter: Harsh J
>Assignee: Yongjun Zhang
> Fix For: 2.8.0
>
> Attachments: YARN-3021.001.patch, YARN-3021.002.patch, 
> YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, 
> YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, 
> YARN-3021.007.patch, YARN-3021.patch
>
>
> Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, 
> and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN 
> clusters.
> Now if one logs in with a COMMON credential, and runs a job on A's YARN that 
> needs to access B's HDFS (such as a DistCp), the operation fails in the RM, 
> as it attempts a renewDelegationToken(…) synchronously during application 
> submission (to validate the managed token before it adds it to a scheduler 
> for automatic renewal). The call obviously fails cause B realm will not trust 
> A's credentials (here, the RM's principal is the renewer).
> In the 1.x JobTracker the same call is present, but it is done asynchronously 
> and once the renewal attempt failed we simply ceased to schedule any further 
> attempts of renewals, rather than fail the job immediately.
> We should change the logic such that we attempt the renewal but go easy on 
> the failure and skip the scheduling alone, rather than bubble back an error 
> to the client, failing the app submission. This way the old behaviour is 
> retained.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500398#comment-14500398
 ] 

Sangjin Lee commented on YARN-3431:
---

I know [~zjshen]'s updating the patch, but I'll provide some feedback based on 
the current patch and the discussion here. Generally I agree with the approach 
of using fields in TimelineEntity to store/retrieve specialized information. 
That would definitely help with the JSON's (lack of) support for polymorphism.

With regards to parent-child relationship and the relationship in general, this 
might be some change, but would it be better to have some kind of a key or a 
label for a relationship? It would help locate the particular relationship 
(e.g. parent) quickly, and help other use cases in identifying exactly the 
relationship it needs to retrieve. Thoughts?

On a related note, I have problems with prohibiting hierarchical timeline 
entities from having any other relationships than parent-child. For example, 
frameworks (e.g. mapreduce) may use hierarchical timeline entities to describe 
their hierarchy (job => task => task attempts), and these entities would have 
dotted lines to YARN system entities (app, containers, etc.) and vice versa. It 
would be a pretty severe restriction to prohibit them. If we adopt the above 
approach, we should be able to allow both, right?

(FlowEntity.java)
- l. 58: do we want to set the id once we calculate it from scratch?

(TimelineEntity.java)
- l.88: Some javadoc would be helpful in explaining this constructor. It 
doesn't come through as very obvious.

> Sub resources of timeline entity needs to be passed to a separate endpoint.
> ---
>
> Key: YARN-3431
> URL: https://issues.apache.org/jira/browse/YARN-3431
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch
>
>
> We have TimelineEntity and some other entities as subclass that inherit from 
> it. However, we only have a single endpoint, which consume TimelineEntity 
> rather than sub-classes and this endpoint will check the incoming request 
> body contains exactly TimelineEntity object. However, the json data which is 
> serialized from sub-class object seems not to be treated as an TimelineEntity 
> object, and won't be deserialized into the corresponding sub-class object 
> which cause deserialization failure as some discussions in YARN-3334 : 
> https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500395#comment-14500395
 ] 

Sunil G commented on YARN-3482:
---

Hi [~elgoiri]
bq.  better to report the resources utilized by the machine.

Do you mean Total CPU, and Total Memory etc. 
Could you please elaborate how this can help in doing a better resource 
allotment. 

As I see, if affinity is not set in CPU, distribution will be more generic and 
it may not be so easy to derive from that.

> Report NM available resources in heartbeat
> --
>
> Key: YARN-3482
> URL: https://issues.apache.org/jira/browse/YARN-3482
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Inigo Goiri
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> NMs are usually collocated with other processes like HDFS, Impala or HBase. 
> To manage this scenario correctly, YARN should be aware of the actual 
> available resources. The proposal is to have an interface to dynamically 
> change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.

2015-04-17 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500405#comment-14500405
 ] 

zhihai xu commented on YARN-3491:
-

I uploaded a new patch YARN-3491.001.patch for review 
I think a little bit deeper, The old patch may have a big delay if multiple 
containers are submitted at the same time.
For example the following log shows 4 containers submitted at very close time:
{code}
2015-04-07 21:42:22,071 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110648_01_078264 transitioned from NEW to 
LOCALIZING
2015-04-07 21:42:22,074 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110652_01_093777 transitioned from NEW to 
LOCALIZING
2015-04-07 21:42:22,076 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110668_01_049049 transitioned from NEW to 
LOCALIZING
2015-04-07 21:42:22,078 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: 
Container container_e30_1426628374875_110668_01_085183 transitioned from NEW to 
LOCALIZING
{code}
The new patch can overlap the delay with public localization from previous 
container, which will be a little bit better and more consistent with the 
behavior in the old code.
Also It will be better for the container which only has private resource and no 
public resource. For this case, no delay will be added to Dispatcher thread.
Finally the change in new patch is a little bit smaller than the first patch.

> PublicLocalizer#addResource is too slow.
> 
>
> Key: YARN-3491
> URL: https://issues.apache.org/jira/browse/YARN-3491
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3491.000.patch, YARN-3491.001.patch
>
>
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
> utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread, 
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
> long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500417#comment-14500417
 ] 

Hadoop QA commented on YARN-2003:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726222/0007-YARN-2003.patch
  against trunk revision c6b5203.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7382//console

This message is automatically generated.

> Support to process Job priority from Submission Context in 
> AppAttemptAddedSchedulerEvent [RM side]
> --
>
> Key: YARN-2003
> URL: https://issues.apache.org/jira/browse/YARN-2003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
> 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
> 0006-YARN-2003.patch, 0007-YARN-2003.patch
>
>
> AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
> Submission Context and store.
> Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500433#comment-14500433
 ] 

Hadoop QA commented on YARN-3491:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726221/YARN-3491.001.patch
  against trunk revision c6b5203.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7383//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7383//console

This message is automatically generated.

> PublicLocalizer#addResource is too slow.
> 
>
> Key: YARN-3491
> URL: https://issues.apache.org/jira/browse/YARN-3491
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3491.000.patch, YARN-3491.001.patch
>
>
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
> utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread, 
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
> long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]

2015-04-17 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-2003:
--
Attachment: 0008-YARN-2003.patch

Fixing a test issue.

> Support to process Job priority from Submission Context in 
> AppAttemptAddedSchedulerEvent [RM side]
> --
>
> Key: YARN-2003
> URL: https://issues.apache.org/jira/browse/YARN-2003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
> 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
> 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch
>
>
> AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
> Submission Context and store.
> Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500481#comment-14500481
 ] 

Hadoop QA commented on YARN-3136:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726204/00012-YARN-3136.patch
  against trunk revision c6b5203.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7379//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/7379//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7379//console

This message is automatically generated.

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, 
> 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
> 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 
> 0009-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500484#comment-14500484
 ] 

Hadoop QA commented on YARN-3487:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726214/YARN-3487.003.patch
  against trunk revision c6b5203.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1207 javac 
compiler warnings (more than the trunk's current 1181 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices
  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils
  
org.apache.hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens
  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

  The following test timeouts occurred in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRTestTests
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestaTesTests
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorizatTests
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7380//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/7380//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7380//console

This message is automatically generated.

> CapacityScheduler scheduler lock obtained unnecessarily
> ---
>
> Key: YARN-3487
> URL: https://issues.apache.org/jira/browse/YARN-3487
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
> YARN-3487.003.patch
>
>
> Recently saw a significant slowdown of applications on a large cluster, and 
> we noticed there were a large number of blocked threads on the RM.  Most of 
> the blocked threads were waiting for the CapacityScheduler lock while calling 
> getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500501#comment-14500501
 ] 

Hadoop QA commented on YARN-3410:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726211/0004-YARN-3410.patch
  against trunk revision c6b5203.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1207 javac 
compiler warnings (more than the trunk's current 1181 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler
  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
  
org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs
  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
  
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7381//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/7381//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7381//console

This message is automatically generated.

> YARN admin should be able to remove individual application records from 
> RMStateStore
> 
>
> Key: YARN-3410
> URL: https://issues.apache.org/jira/browse/YARN-3410
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Reporter: Wangda Tan
>Assignee: Rohith
>Priority: Critical
> Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
> 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, 
> 0004-YARN-3410.patch
>
>
> When RM state store entered an unexpected state, one example is YARN-2340, 
> when an attempt is not in final state but app already completed, RM can never 
> get up unless format RMStateStore.
> I think we should support remove individual application records from 
> RMStateStore to unblock RM admin make choice of either waiting for a fix or 
> format state store.
> In addition, RM should be able to report all fatal errors (which will 
> shutdown RM) when doing app recovery, this can save admin some time to remove 
> apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-04-17 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500506#comment-14500506
 ] 

Sunil G commented on YARN-3136:
---

Hi [~jianhe]

I used below suppression 
{noformat}
  



  
{noformat}

But still i get same problem. Cud i try to suppress with field level? Pls 
suggest.

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, 
> 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, 
> 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, 
> 0009-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500513#comment-14500513
 ] 

Hadoop QA commented on YARN-2003:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726239/0008-YARN-2003.patch
  against trunk revision c6b5203.

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7384//console

This message is automatically generated.

> Support to process Job priority from Submission Context in 
> AppAttemptAddedSchedulerEvent [RM side]
> --
>
> Key: YARN-2003
> URL: https://issues.apache.org/jira/browse/YARN-2003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, 
> 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 
> 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch
>
>
> AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from 
> Submission Context and store.
> Later this can be used by Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore

2015-04-17 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500526#comment-14500526
 ] 

Rohith commented on YARN-3410:
--

All tests failed with BindException.. Jenkins need to kick off again to get 
another report!!

> YARN admin should be able to remove individual application records from 
> RMStateStore
> 
>
> Key: YARN-3410
> URL: https://issues.apache.org/jira/browse/YARN-3410
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Reporter: Wangda Tan
>Assignee: Rohith
>Priority: Critical
> Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
> 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, 
> 0004-YARN-3410.patch
>
>
> When RM state store entered an unexpected state, one example is YARN-2340, 
> when an attempt is not in final state but app already completed, RM can never 
> get up unless format RMStateStore.
> I think we should support remove individual application records from 
> RMStateStore to unblock RM admin make choice of either waiting for a fix or 
> format state store.
> In addition, RM should be able to report all fatal errors (which will 
> shutdown RM) when doing app recovery, this can save admin some time to remove 
> apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps

2015-04-17 Thread Junping Du (JIRA)
Junping Du created YARN-3505:


 Summary: Node's Log Aggregation Report with SUCCEED should not 
cached in RMApps
 Key: YARN-3505
 URL: https://issues.apache.org/jira/browse/YARN-3505
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Junping Du
Assignee: Xuan Gong
Priority: Critical


Per discussions in YARN-1402, we shouldn't cache all node's log aggregation 
reports in RMApps for always, especially for those finished with SUCCEED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1402) Related Web UI, CLI changes on exposing client API to check log aggregation status

2015-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500580#comment-14500580
 ] 

Hudson commented on YARN-1402:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7606 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7606/])
YARN-1402. Update related Web UI and CLI with exposing client API to check log 
aggregation status. Contributed by Xuan Gong. (junping_du: rev 
1db355a875c3ecc40a244045c6812e00c8d36ef1)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationReport.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationReportPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ProtoUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/records/LogAggregationStatus.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java


> Related Web UI, CLI changes on exposing client API to check log aggregation 
> status
> --
>
> Key: YARN-1402
> URL: https://issues.apache.org/jira/browse/YARN-1402
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xuan Gong
>Assignee: Xuan Gong
> Fix For: 2.8.0
>
> Attachments: YARN-1402.1.patch, YARN-1402.2.patch, 
> YARN-1402.3.1.patch, YARN-1402.3.2.patch, YARN-1402.3.patch, YARN-1402.4.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3493:
--
Attachment: YARN-3493.4.patch

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(579)) - Failed to load/recover state
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.had

[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label

2015-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500631#comment-14500631
 ] 

Hudson commented on YARN-2696:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7607 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7607/])
YARN-2696. Queue sorting in CapacityScheduler should consider node label. 
Contributed by Wangda Tan (jianhe: rev d573f09fb93dbb711d504620af5d73840ea063a6)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/PartitionedQueueComparator.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ReservationQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java
* hadoop-yarn-project/CHANGES.txt
* hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/ResourceUsage.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/QueueCapacities.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestNodeLabelContainerAllocation.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java


> Queue sorting in CapacityScheduler should consider node label
> -
>
> Key: YARN-2696
> URL: https://issues.apache.org/jira/browse/YARN-2696
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler, resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.8.0
>
> Attachments: YARN-2696.1.patch, YARN-2696.2.patch, YARN-2696.3.patch, 
> YARN-2696.4.patch
>
>
> In the past, when trying to allocate containers under a parent queue in 
> CapacityScheduler. The parent queue will choose child queues by the used 
> resource from smallest to largest. 
> Now we support node label in

[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500672#comment-14500672
 ] 

Zhijie Shen commented on YARN-3431:
---

[~sjlee0], how about we doing this? Instead of using relates_to/is_related_to 
to store the parent-child relationship, we put them into info section. Then, we 
can search for this parent-child relationship quickly, and we don't disturb the 
normal usage of relates_to/is_related_to. Does it sound good?

> Sub resources of timeline entity needs to be passed to a separate endpoint.
> ---
>
> Key: YARN-3431
> URL: https://issues.apache.org/jira/browse/YARN-3431
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch
>
>
> We have TimelineEntity and some other entities as subclass that inherit from 
> it. However, we only have a single endpoint, which consume TimelineEntity 
> rather than sub-classes and this endpoint will check the incoming request 
> body contains exactly TimelineEntity object. However, the json data which is 
> serialized from sub-class object seems not to be treated as an TimelineEntity 
> object, and won't be deserialized into the corresponding sub-class object 
> which cause deserialization failure as some discussions in YARN-3334 : 
> https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-04-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3136:
--
Attachment: 00013-YARN-3136.patch

> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, 
> 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 
> 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 
> 0008-YARN-3136.patch, 0009-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration

2015-04-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500680#comment-14500680
 ] 

Jian He commented on YARN-3136:
---

Hi [~sunilg], we can suppress at the filed level. Just uploaded a patch myself.


> getTransferredContainers can be a bottleneck during AM registration
> ---
>
> Key: YARN-3136
> URL: https://issues.apache.org/jira/browse/YARN-3136
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Sunil G
> Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, 
> 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, 
> 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, 
> 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, 
> 0008-YARN-3136.patch, 0009-YARN-3136.patch
>
>
> While examining RM stack traces on a busy cluster I noticed a pattern of AMs 
> stuck waiting for the scheduler lock trying to call getTransferredContainers. 
>  The scheduler lock is highly contended, especially on a large cluster with 
> many nodes heartbeating, and it would be nice if we could find a way to 
> eliminate the need to grab this lock during this call.  We've already done 
> similar work during AM allocate calls to make sure they don't needlessly grab 
> the scheduler lock, and it would be good to do so here as well, if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500737#comment-14500737
 ] 

Sangjin Lee commented on YARN-3431:
---

Then would it be mapped as something like "PARENT" => parent entity identifier, 
etc.?

> Sub resources of timeline entity needs to be passed to a separate endpoint.
> ---
>
> Key: YARN-3431
> URL: https://issues.apache.org/jira/browse/YARN-3431
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch
>
>
> We have TimelineEntity and some other entities as subclass that inherit from 
> it. However, we only have a single endpoint, which consume TimelineEntity 
> rather than sub-classes and this endpoint will check the incoming request 
> body contains exactly TimelineEntity object. However, the json data which is 
> serialized from sub-class object seems not to be treated as an TimelineEntity 
> object, and won't be deserialized into the corresponding sub-class object 
> which cause deserialization failure as some discussions in YARN-3334 : 
> https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500755#comment-14500755
 ] 

Zhijie Shen commented on YARN-3431:
---

Could be. Or addParent(entity_type, entity_id) is translated to 
info.add("PARENT:" + entity_type, entity_id), and child similarly. I think 
former is better for get a set of children while latter will be good to search 
for single child/parent.

> Sub resources of timeline entity needs to be passed to a separate endpoint.
> ---
>
> Key: YARN-3431
> URL: https://issues.apache.org/jira/browse/YARN-3431
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch
>
>
> We have TimelineEntity and some other entities as subclass that inherit from 
> it. However, we only have a single endpoint, which consume TimelineEntity 
> rather than sub-classes and this endpoint will check the incoming request 
> body contains exactly TimelineEntity object. However, the json data which is 
> serialized from sub-class object seems not to be treated as an TimelineEntity 
> object, and won't be deserialized into the corresponding sub-class object 
> which cause deserialization failure as some discussions in YARN-3334 : 
> https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500760#comment-14500760
 ] 

Hadoop QA commented on YARN-3493:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726253/YARN-3493.4.patch
  against trunk revision 1db355a.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7385//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7385//console

This message is automatically generated.

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-1

[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily

2015-04-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500767#comment-14500767
 ] 

Wangda Tan commented on YARN-3487:
--

Re-triggerred Jenkins

> CapacityScheduler scheduler lock obtained unnecessarily
> ---
>
> Key: YARN-3487
> URL: https://issues.apache.org/jira/browse/YARN-3487
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
> YARN-3487.003.patch
>
>
> Recently saw a significant slowdown of applications on a large cluster, and 
> we noticed there were a large number of blocked threads on the RM.  Most of 
> the blocked threads were waiting for the CapacityScheduler lock while calling 
> getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500787#comment-14500787
 ] 

Wangda Tan commented on YARN-3493:
--

Some comments:
- validateResourceRequest -> normalizeAndValidateRequest and make the 
isRecovery to be false, and it should be package-visible
- normalizeNodeLabelForRequest -> normalizeNodeLabelExpressionInRequest, and it 
should be private

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(579)) - Failed to load/recover state
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceM

[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500795#comment-14500795
 ] 

Hadoop QA commented on YARN-3410:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726211/0004-YARN-3410.patch
  against trunk revision d573f09.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7386//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7386//console

This message is automatically generated.

> YARN admin should be able to remove individual application records from 
> RMStateStore
> 
>
> Key: YARN-3410
> URL: https://issues.apache.org/jira/browse/YARN-3410
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Reporter: Wangda Tan
>Assignee: Rohith
>Priority: Critical
> Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, 
> 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, 
> 0004-YARN-3410.patch
>
>
> When RM state store entered an unexpected state, one example is YARN-2340, 
> when an attempt is not in final state but app already completed, RM can never 
> get up unless format RMStateStore.
> I think we should support remove individual application records from 
> RMStateStore to unblock RM admin make choice of either waiting for a fix or 
> format state store.
> In addition, RM should be able to report all fatal errors (which will 
> shutdown RM) when doing app recovery, this can save admin some time to remove 
> apps in bad state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3493:
--
Attachment: YARN-3493.5.patch

Thanks for the review. 
updated the patch accordingly 

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(579)) - Failed to load/recover state
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.Abs

[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3493:
--
Attachment: (was: YARN-3493.5.patch)

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(579)) - Failed to load/recover state
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org

[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-3493:
--
Attachment: YARN-3493.5.patch

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(579)) - Failed to load/recover state
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> a

[jira] [Commented] (YARN-3451) Add start time and Elapsed in ApplicationAttemptReport and display the same in RMAttemptBlock WebUI

2015-04-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500843#comment-14500843
 ] 

Jian He commented on YARN-3451:
---

looks good, +1

> Add start time and Elapsed in ApplicationAttemptReport and display the same 
> in RMAttemptBlock WebUI
> ---
>
> Key: YARN-3451
> URL: https://issues.apache.org/jira/browse/YARN-3451
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, webapp
>Reporter: Rohith
>Assignee: Rohith
> Attachments: 0001-YARN-3451.patch, 0001-YARN-3451.patch, Screen Shot 
> 2015-04-11 at 12.38.05 AM.png
>
>
> Unlike ApplicationReport and ApplicationBlock has *Started:* and *Elapsed:* 
> time, It would be usefull if start time and Elapsed is sent in 
> ApplicationAttemptReport and display in ApplicationAttemptBlock. 
> This gives granular debugging ability when analyzing issue with multiple 
> attempt failure like attempt timedout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3451) Add start time and Elapsed in ApplicationAttemptReport and display the same in RMAttemptBlock WebUI

2015-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500867#comment-14500867
 ] 

Hudson commented on YARN-3451:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7608 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7608/])
YARN-3451. Display attempt start time and elapsed time on the web UI. 
Contributed by Rohith Sharmaks (jianhe: rev 
6779467ab6fcc6a02d0e8c80b138cc9df1aa831e)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationAttemptReportPBImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationAttemptReport.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestYarnClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/ProtocolHATestBase.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java


> Add start time and Elapsed in ApplicationAttemptReport and display the same 
> in RMAttemptBlock WebUI
> ---
>
> Key: YARN-3451
> URL: https://issues.apache.org/jira/browse/YARN-3451
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api, webapp
>Reporter: Rohith
>Assignee: Rohith
> Fix For: 2.8.0
>
> Attachments: 0001-YARN-3451.patch, 0001-YARN-3451.patch, Screen Shot 
> 2015-04-11 at 12.38.05 AM.png
>
>
> Unlike ApplicationReport and ApplicationBlock has *Started:* and *Elapsed:* 
> time, It would be usefull if start time and Elapsed is sent in 
> ApplicationAttemptReport and display in ApplicationAttemptBlock. 
> This gives granular debugging ability when analyzing issue with multiple 
> attempt failure like attempt timedout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3497) ContainerManagementProtocolProxy modifies IPC timeout conf without making a copy

2015-04-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500873#comment-14500873
 ] 

Jian He commented on YARN-3497:
---

should we change below to use this.conf as well ?
{code}
conf.setInt(
  CommonConfigurationKeysPublic.IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY,
  0);
{code}

> ContainerManagementProtocolProxy modifies IPC timeout conf without making a 
> copy
> 
>
> Key: YARN-3497
> URL: https://issues.apache.org/jira/browse/YARN-3497
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-3497.001.patch
>
>
> yarn-client's ContainerManagementProtocolProxy is updating 
> ipc.client.connection.maxidletime in the conf passed in without making a copy 
> of it. That modification "leaks" into other systems using the same conf and 
> can cause them to setup RPC connections with a timeout of zero as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500891#comment-14500891
 ] 

Hadoop QA commented on YARN-3487:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726214/YARN-3487.003.patch
  against trunk revision d573f09.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7387//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7387//console

This message is automatically generated.

> CapacityScheduler scheduler lock obtained unnecessarily
> ---
>
> Key: YARN-3487
> URL: https://issues.apache.org/jira/browse/YARN-3487
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
> YARN-3487.003.patch
>
>
> Recently saw a significant slowdown of applications on a large cluster, and 
> we noticed there were a large number of blocked threads on the RM.  Most of 
> the blocked threads were waiting for the CapacityScheduler lock while calling 
> getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat

2015-04-17 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500894#comment-14500894
 ] 

Inigo Goiri commented on YARN-3482:
---

Hi Sunil G, yes, I'm talking about Total CPU and Total Memory.
Combining this with YARN-3481, we can estimate the load in the node that is not 
caused by the containers (external processes).
Right now, the server could be overloaded by HBase for example and we would be 
sending more load there.

As Karthik Kambatla mentions, this would be a very conservative scenario where 
the external processes have absolute priority.
This might be a desired behavior for some users but the proposal is to also add 
an interface to dynamically change the amount of available resources according 
to the behavior of the external processes.
Both approaches target the same problem and are complementary/orthogonal.

I understand this other approach of sending node utilization might be a little 
out of the scope of this JIRA but I could open a new one with this 
functionality.

> Report NM available resources in heartbeat
> --
>
> Key: YARN-3482
> URL: https://issues.apache.org/jira/browse/YARN-3482
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager, resourcemanager
>Affects Versions: 2.7.0
>Reporter: Inigo Goiri
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> NMs are usually collocated with other processes like HDFS, Impala or HBase. 
> To manage this scenario correctly, YARN should be aware of the actual 
> available resources. The proposal is to have an interface to dynamically 
> change the available resources and report this to the RM in every heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500934#comment-14500934
 ] 

Hadoop QA commented on YARN-3493:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12726282/YARN-3493.5.patch
  against trunk revision d573f09.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/7388//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7388//console

This message is automatically generated.

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)

[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler

2015-04-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500937#comment-14500937
 ] 

Jian He commented on YARN-3463:
---


- this.schedulableEntities != null can never happen? since each time it’s 
re-creating a new Ordering class. I think we can just initialize 
{{this.comparator}} and {{this.schedulableEntities}} inside FifoOrderingPolicy 
constructor and remove the setComparator method
{code}
if (this.schedulableEntities != null) {
  
schedulableEntities.addAll(this.schedulableEntities);
}
{code}
- this should be inside the {removed} check ? otherwise any unknown container 
completed will cause reordering 
{code}
orderingPolicy.containerReleased(application, rmContainer);
{code}
- getStatusMessage -> getInfo ?

> Integrate OrderingPolicy Framework with CapacityScheduler
> -
>
> Key: YARN-3463
> URL: https://issues.apache.org/jira/browse/YARN-3463
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Craig Welch
>Assignee: Craig Welch
> Attachments: YARN-3463.50.patch, YARN-3463.61.patch, 
> YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, 
> YARN-3463.67.patch, YARN-3463.68.patch
>
>
> Integrate the OrderingPolicy Framework with the CapacityScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500940#comment-14500940
 ] 

Jian He commented on YARN-3493:
---

TestAMRestart failure unrelated, passing locally.

> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208)
> 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:serviceStart(579)) - Failed to load/recover state
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.

[jira] [Updated] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue

2015-04-17 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3487:
-
Summary: CapacityScheduler scheduler lock obtained unnecessarily when 
calling getQueue  (was: CapacityScheduler scheduler lock obtained unnecessarily)

> CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue
> -
>
> Key: YARN-3487
> URL: https://issues.apache.org/jira/browse/YARN-3487
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
> YARN-3487.003.patch
>
>
> Recently saw a significant slowdown of applications on a large cluster, and 
> we noticed there were a large number of blocked threads on the RM.  Most of 
> the blocked threads were waiting for the CapacityScheduler lock while calling 
> getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed

2015-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500946#comment-14500946
 ] 

Hudson commented on YARN-3493:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7609 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7609/])
YARN-3493. RM fails to come up with error "Failed to load/recover state" when 
mem settings are changed. (Jian He via wangda) (wangda: rev 
f65eeb412d140a3808bcf99344a9f3a965918f70)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* hadoop-yarn-project/CHANGES.txt


> RM fails to come up with error "Failed to load/recover state" when  mem 
> settings are changed
> 
>
> Key: YARN-3493
> URL: https://issues.apache.org/jira/browse/YARN-3493
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.0
>Reporter: Sumana Sathish
>Assignee: Jian He
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, 
> YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip
>
>
> RM fails to come up for the following case:
> 1. Change yarn.nodemanager.resource.memory-mb and 
> yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml
> 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in 
> background and wait for the job to reach running state
> 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 
> before the above job completes
> 4. Restart RM
> 5. RM fails to come up with the below error
> {code:title= RM error for Mem settings changed}
>  - RM app submission failed in validating AM resource request for application 
> application_1429094976272_0008
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested memory < 0, or requested memory > max configured, 
> requestedMemory=3072, maxMemory=2048
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.j

[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue

2015-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500954#comment-14500954
 ] 

Hudson commented on YARN-3487:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7610 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7610/])
YARN-3487. CapacityScheduler scheduler lock obtained unnecessarily when calling 
getQueue (Jason Lowe via wangda) (wangda: rev 
f47a5763acd55cb0b3f16152c7f8df06ec0e09a9)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* hadoop-yarn-project/CHANGES.txt


> CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue
> -
>
> Key: YARN-3487
> URL: https://issues.apache.org/jira/browse/YARN-3487
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.6.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 2.7.1
>
> Attachments: YARN-3487.001.patch, YARN-3487.002.patch, 
> YARN-3487.003.patch
>
>
> Recently saw a significant slowdown of applications on a large cluster, and 
> we noticed there were a large number of blocked threads on the RM.  Most of 
> the blocked threads were waiting for the CapacityScheduler lock while calling 
> getQueueInfo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3506) Error handling on NM reporting invalid NodeLabels in distributed Node Label configuration

2015-04-17 Thread Naganarasimha G R (JIRA)
Naganarasimha G R created YARN-3506:
---

 Summary: Error handling on NM reporting invalid NodeLabels in 
distributed Node Label configuration
 Key: YARN-3506
 URL: https://issues.apache.org/jira/browse/YARN-3506
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R


As per 
[conclusion|https://issues.apache.org/jira/browse/YARN-2495?focusedCommentId=14358109&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14358109]
 in YARN-2495, following error handling needs to be done
* Show/log diagnostic in RM (nodes) page and NM page, saying label is invalid. 
(Need modify web UI, can be done in a separated task)
* Make the node's labels to be empty, so that applications can continue use it.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3504) TestRMRestart fails occasionally in trunk

2015-04-17 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500968#comment-14500968
 ] 

Naganarasimha G R commented on YARN-3504:
-

seems like duplicate of YARN-2871

> TestRMRestart fails occasionally in trunk
> -
>
> Key: YARN-3504
> URL: https://issues.apache.org/jira/browse/YARN-3504
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Xuan Gong
>Priority: Minor
>
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
> Stacktrace
> org.mockito.exceptions.verification.TooLittleActualInvocations: 
> rMAppManager.logApplicationSummary(
> isA(org.apache.hadoop.yarn.api.records.ApplicationId)
> );
> Wanted 3 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969)
> But was 2 times:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled

2015-04-17 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500975#comment-14500975
 ] 

Naganarasimha G R commented on YARN-2740:
-

Thanks for the comments [~wangda], 

bq. I think it's not a big problem, NM doesn't need to know "x" being removed, 
the logic should be, NM reports label, and RM allocate according to label, NM 
should just move on if adding label failed 
Well IIUC, based on your reply to first point ??prevent admin remove 
clusterNodeLabel when distributed enabled?? we need worry about this second 
point right? as user will not be able to remove cluster node label
bq. as what we done in YARN-2495. My opinion here is not add extra RM->NM 
communicate.
As per last discussion in YARN-2495 you had given a concluded as per this 
[comment 
|https://issues.apache.org/jira/browse/YARN-2495?focusedCommentId=14358109&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14358109]
 that 
{quote}
* Show/log diagnostic in RM (nodes) page and NM page, saying label is invalid. 
(Need modify web UI, can be done in a separated task)
* Make the node's labels to be empty, so that applications can continue use it.
{quote}
based on this i mentioned RM->NM communicate/notify would be required as labels 
are sent only on change in NM side  and  it will not be able show that there is 
error in reporting labels. In btw have raised new jira YARN-3506 for this error 
handling reported in YARN-2495
Test failure is not related to this patch and will work on {{prevent admin 
remove clusterNodeLabel when distributed enabled.}} and resubmit the patch.

> ResourceManager side should properly handle node label modifications when 
> distributed node label configuration enabled
> --
>
> Key: YARN-2740
> URL: https://issues.apache.org/jira/browse/YARN-2740
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
> Fix For: 2.8.0
>
> Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, 
> YARN-2740.20150327-1.patch, YARN-2740.20150411-1.patch, 
> YARN-2740.20150411-2.patch, YARN-2740.20150411-3.patch, 
> YARN-2740.20150417-1.patch
>
>
> According to YARN-2495, when distributed node label configuration is enabled:
> - RMAdmin / REST API should reject change labels on node operations.
> - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do 
> heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.

2015-04-17 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-3431:
--
Attachment: YARN-3431.4.patch

Upload a new patch:

1. Use info to store parent/children relationship.
2. Fix a couple bugs that I found in local test.
3. Change prototype to real.
4. Set Id from the parts.
5. Add the javadoc to describe the constructor of the proxy case.

> Sub resources of timeline entity needs to be passed to a separate endpoint.
> ---
>
> Key: YARN-3431
> URL: https://issues.apache.org/jira/browse/YARN-3431
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, 
> YARN-3431.4.patch
>
>
> We have TimelineEntity and some other entities as subclass that inherit from 
> it. However, we only have a single endpoint, which consume TimelineEntity 
> rather than sub-classes and this endpoint will check the incoming request 
> body contains exactly TimelineEntity object. However, the json data which is 
> serialized from sub-class object seems not to be treated as an TimelineEntity 
> object, and won't be deserialized into the corresponding sub-class object 
> which cause deserialization failure as some discussions in YARN-3334 : 
> https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS

2015-04-17 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500998#comment-14500998
 ] 

Zhijie Shen commented on YARN-3046:
---

In the new patch, the task entity ID is set correctly. But can we use reflect 
or make "getTaskId" the interface method of TaskEvent to simply the code change?

> [Event producers] Implement MapReduce AM writing some MR metrics to ATS
> ---
>
> Key: YARN-3046
> URL: https://issues.apache.org/jira/browse/YARN-3046
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Junping Du
> Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, 
> YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch, 
> YARN-3046-v3.patch
>
>
> Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes 
> written) and have the MR AM write the framework-specific metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.

2015-04-17 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501002#comment-14501002
 ] 

zhihai xu commented on YARN-3491:
-

I did more profiling in checkLocalDir. It really surprised me.
The most time-consuming code is status.getPermission() not lfs.getFileStatus.
status.getPermission() will take 4 or 5 ms. checkLocalDir will call 
status.getPermission() three times.
That is why checkLocalDir take 10+ms.
{code}
  private boolean checkLocalDir(String localDir) {

Map pathPermissionMap = 
getLocalDirsPathPermissionsMap(localDir);

for (Map.Entry entry : pathPermissionMap.entrySet()) {
  FileStatus status;
  try {
status = lfs.getFileStatus(entry.getKey());
  } catch (Exception e) {
String msg =
"Could not carry out resource dir checks for " + localDir
+ ", which was marked as good";
LOG.warn(msg, e);
throw new YarnRuntimeException(msg, e);
  }

  if (!status.getPermission().equals(entry.getValue())) {
String msg =
"Permissions incorrectly set for dir " + entry.getKey()
+ ", should be " + entry.getValue() + ", actual value = "
+ status.getPermission();
LOG.warn(msg);
throw new YarnRuntimeException(msg);
  }
}
return true;
  }
{code}

Then I go deeper into the source code I find out why status.getPermission take 
the most of time:
lfs.getFileStatus will return RawLocalFileSystem#DeprecatedRawLocalFileStatus,
{code}
public FsPermission getPermission() {
  if (!isPermissionLoaded()) {
loadPermissionInfo();
  }
  return super.getPermission();
}
{code}

So status.getPermission will call loadPermissionInfo,
Based on the following code, loadPermissionInfo is bottle neck, it will call 
run "ls -ld" to get the permission, which is really slow.
{code}
/// loads permissions, owner, and group from `ls -ld`
private void loadPermissionInfo() {
  IOException e = null;
  try {
String output = FileUtil.execCommand(new File(getPath().toUri()), 
Shell.getGetPermissionCommand());
StringTokenizer t =
new StringTokenizer(output, Shell.TOKEN_SEPARATOR_REGEX);
//expected format
//-rw---1 username groupname ...
String permission = t.nextToken();
if (permission.length() > FsPermission.MAX_PERMISSION_LENGTH) {
  //files with ACLs might have a '+'
  permission = permission.substring(0,
FsPermission.MAX_PERMISSION_LENGTH);
}
setPermission(FsPermission.valueOf(permission));
t.nextToken();

String owner = t.nextToken();
// If on windows domain, token format is DOMAIN\\user and we want to
// extract only the user name
if (Shell.WINDOWS) {
  int i = owner.indexOf('\\');
  if (i != -1)
owner = owner.substring(i + 1);
}
setOwner(owner);

setGroup(t.nextToken());
  } catch (Shell.ExitCodeException ioe) {
if (ioe.getExitCode() != 1) {
  e = ioe;
} else {
  setPermission(null);
  setOwner(null);
  setGroup(null);
}
  } catch (IOException ioe) {
e = ioe;
  } finally {
if (e != null) {
  throw new RuntimeException("Error while running command to get " +
 "file permissions : " + 
 StringUtils.stringifyException(e));
}
  }
}
{code}

We should call getPermission as least as possible in the future :)

> PublicLocalizer#addResource is too slow.
> 
>
> Key: YARN-3491
> URL: https://issues.apache.org/jira/browse/YARN-3491
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3491.000.patch, YARN-3491.001.patch
>
>
> Based on the profiling, The bottleneck in PublicLocalizer#addResource is 
> getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir.
> checkLocalDir is very slow which takes about 10+ ms.
> The total delay will be approximately number of local dirs * 10+ ms.
> This delay will be added for each public resource localization.
> Because PublicLocalizer#addResource is slow, the thread pool can't be fully 
> utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> And also PublicLocalizer#addResource is running in Dispatcher thread, 
> So the Dispatcher thread will be blocked by PublicLocalizer#addResource for 
> long time.



--
This message was sent by At

[jira] [Commented] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS

2015-04-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501015#comment-14501015
 ] 

Junping Du commented on YARN-3046:
--

bq. In the new patch, the task entity ID is set correctly. But can we use 
reflect or make "getTaskId" the interface method of TaskEvent to simply the 
code change?
I agree that it could make code more concisely if we can differentiate job 
entities from task entities with refection or some other ways. Coincidentally, 
I also put some ideas and comments to MAPREDUCE-6318. However, can we do this 
refactor together with v1 timeline service in MAPREDUCE-6318? The code here far 
more than acceptable, I believe.

> [Event producers] Implement MapReduce AM writing some MR metrics to ATS
> ---
>
> Key: YARN-3046
> URL: https://issues.apache.org/jira/browse/YARN-3046
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Junping Du
> Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, 
> YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch, 
> YARN-3046-v3.patch
>
>
> Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes 
> written) and have the MR AM write the framework-specific metrics to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS

2015-04-17 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501026#comment-14501026
 ] 

Junping Du commented on YARN-3044:
--

Thanks [~Naganarasimha] for updating the patch! 
Looks like it depends on YARN-3390 for current patch, so I will go ahead to 
review YARN-3390 first and back to your patch soon. 

> [Event producers] Implement RM writing app lifecycle events to ATS
> --
>
> Key: YARN-3044
> URL: https://issues.apache.org/jira/browse/YARN-3044
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Sangjin Lee
>Assignee: Naganarasimha G R
> Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, 
> YARN-3044.20150416-1.patch
>
>
> Per design in YARN-2928, implement RM writing app lifecycle events to ATS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers

2015-04-17 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501037#comment-14501037
 ] 

Jun Gong commented on YARN-3474:


Any comments appreciate.

> Add a way to let NM wait RM to come back, not kill running containers
> -
>
> Key: YARN-3474
> URL: https://issues.apache.org/jira/browse/YARN-3474
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.6.0
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-3474.01.patch
>
>
> When RM HA is enabled and active RM shuts down, standby RM will become 
> active, recover apps and attempts. Apps will not be affected. 
> If there are some cases or bugs that cause both RM could not start 
> normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM 
> could not connect with ZK well). NM will kill containers running on it when  
> it could not heartbeat with RM for some time(max retry time is 15 mins by 
> default). Then all apps will be killed. 
> In production cluster, we might come across above cases and fixing these bugs 
> might need time more than 15 mins. In order to let apps not be affected and 
> killed by NM, YARN admin could set a flag(the flag is a znode 
> '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to 
> come back and not kill running containers. After fixing bugs and RM start 
> normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2