[jira] [Commented] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly over DistCp
[ https://issues.apache.org/jira/browse/YARN-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500368#comment-14500368 ] Yongjun Zhang commented on YARN-3021: - Thanks also to [~ka...@cloudera.com] for the earlier discussions, and we worked out a release notes which I just updated. > YARN's delegation-token handling disallows certain trust setups to operate > properly over DistCp > --- > > Key: YARN-3021 > URL: https://issues.apache.org/jira/browse/YARN-3021 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 2.3.0 >Reporter: Harsh J >Assignee: Yongjun Zhang > Fix For: 2.8.0 > > Attachments: YARN-3021.001.patch, YARN-3021.002.patch, > YARN-3021.003.patch, YARN-3021.004.patch, YARN-3021.005.patch, > YARN-3021.006.patch, YARN-3021.007.patch, YARN-3021.007.patch, > YARN-3021.007.patch, YARN-3021.patch > > > Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, > and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN > clusters. > Now if one logs in with a COMMON credential, and runs a job on A's YARN that > needs to access B's HDFS (such as a DistCp), the operation fails in the RM, > as it attempts a renewDelegationToken(…) synchronously during application > submission (to validate the managed token before it adds it to a scheduler > for automatic renewal). The call obviously fails cause B realm will not trust > A's credentials (here, the RM's principal is the renewer). > In the 1.x JobTracker the same call is present, but it is done asynchronously > and once the renewal attempt failed we simply ceased to schedule any further > attempts of renewals, rather than fail the job immediately. > We should change the logic such that we attempt the renewal but go easy on > the failure and skip the scheduling alone, rather than bubble back an error > to the client, failing the app submission. This way the old behaviour is > retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500398#comment-14500398 ] Sangjin Lee commented on YARN-3431: --- I know [~zjshen]'s updating the patch, but I'll provide some feedback based on the current patch and the discussion here. Generally I agree with the approach of using fields in TimelineEntity to store/retrieve specialized information. That would definitely help with the JSON's (lack of) support for polymorphism. With regards to parent-child relationship and the relationship in general, this might be some change, but would it be better to have some kind of a key or a label for a relationship? It would help locate the particular relationship (e.g. parent) quickly, and help other use cases in identifying exactly the relationship it needs to retrieve. Thoughts? On a related note, I have problems with prohibiting hierarchical timeline entities from having any other relationships than parent-child. For example, frameworks (e.g. mapreduce) may use hierarchical timeline entities to describe their hierarchy (job => task => task attempts), and these entities would have dotted lines to YARN system entities (app, containers, etc.) and vice versa. It would be a pretty severe restriction to prohibit them. If we adopt the above approach, we should be able to allow both, right? (FlowEntity.java) - l. 58: do we want to set the id once we calculate it from scratch? (TimelineEntity.java) - l.88: Some javadoc would be helpful in explaining this constructor. It doesn't come through as very obvious. > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch > > > We have TimelineEntity and some other entities as subclass that inherit from > it. However, we only have a single endpoint, which consume TimelineEntity > rather than sub-classes and this endpoint will check the incoming request > body contains exactly TimelineEntity object. However, the json data which is > serialized from sub-class object seems not to be treated as an TimelineEntity > object, and won't be deserialized into the corresponding sub-class object > which cause deserialization failure as some discussions in YARN-3334 : > https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500395#comment-14500395 ] Sunil G commented on YARN-3482: --- Hi [~elgoiri] bq. better to report the resources utilized by the machine. Do you mean Total CPU, and Total Memory etc. Could you please elaborate how this can help in doing a better resource allotment. As I see, if affinity is not set in CPU, distribution will be more generic and it may not be so easy to derive from that. > Report NM available resources in heartbeat > -- > > Key: YARN-3482 > URL: https://issues.apache.org/jira/browse/YARN-3482 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.7.0 >Reporter: Inigo Goiri > Original Estimate: 504h > Remaining Estimate: 504h > > NMs are usually collocated with other processes like HDFS, Impala or HBase. > To manage this scenario correctly, YARN should be aware of the actual > available resources. The proposal is to have an interface to dynamically > change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500405#comment-14500405 ] zhihai xu commented on YARN-3491: - I uploaded a new patch YARN-3491.001.patch for review I think a little bit deeper, The old patch may have a big delay if multiple containers are submitted at the same time. For example the following log shows 4 containers submitted at very close time: {code} 2015-04-07 21:42:22,071 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110648_01_078264 transitioned from NEW to LOCALIZING 2015-04-07 21:42:22,074 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110652_01_093777 transitioned from NEW to LOCALIZING 2015-04-07 21:42:22,076 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110668_01_049049 transitioned from NEW to LOCALIZING 2015-04-07 21:42:22,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e30_1426628374875_110668_01_085183 transitioned from NEW to LOCALIZING {code} The new patch can overlap the delay with public localization from previous container, which will be a little bit better and more consistent with the behavior in the old code. Also It will be better for the container which only has private resource and no public resource. For this case, no delay will be added to Dispatcher thread. Finally the change in new patch is a little bit smaller than the first patch. > PublicLocalizer#addResource is too slow. > > > Key: YARN-3491 > URL: https://issues.apache.org/jira/browse/YARN-3491 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3491.000.patch, YARN-3491.001.patch > > > Based on the profiling, The bottleneck in PublicLocalizer#addResource is > getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. > checkLocalDir is very slow which takes about 10+ ms. > The total delay will be approximately number of local dirs * 10+ ms. > This delay will be added for each public resource localization. > Because PublicLocalizer#addResource is slow, the thread pool can't be fully > utilized. Instead of doing public resource localization in > parallel(multithreading), public resource localization is serialized most of > the time. > And also PublicLocalizer#addResource is running in Dispatcher thread, > So the Dispatcher thread will be blocked by PublicLocalizer#addResource for > long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500417#comment-14500417 ] Hadoop QA commented on YARN-2003: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726222/0007-YARN-2003.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7382//console This message is automatically generated. > Support to process Job priority from Submission Context in > AppAttemptAddedSchedulerEvent [RM side] > -- > > Key: YARN-2003 > URL: https://issues.apache.org/jira/browse/YARN-2003 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, > 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, > 0006-YARN-2003.patch, 0007-YARN-2003.patch > > > AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from > Submission Context and store. > Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500433#comment-14500433 ] Hadoop QA commented on YARN-3491: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726221/YARN-3491.001.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 eclipse:eclipse{color}. The patch failed to build with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7383//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7383//console This message is automatically generated. > PublicLocalizer#addResource is too slow. > > > Key: YARN-3491 > URL: https://issues.apache.org/jira/browse/YARN-3491 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3491.000.patch, YARN-3491.001.patch > > > Based on the profiling, The bottleneck in PublicLocalizer#addResource is > getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. > checkLocalDir is very slow which takes about 10+ ms. > The total delay will be approximately number of local dirs * 10+ ms. > This delay will be added for each public resource localization. > Because PublicLocalizer#addResource is slow, the thread pool can't be fully > utilized. Instead of doing public resource localization in > parallel(multithreading), public resource localization is serialized most of > the time. > And also PublicLocalizer#addResource is running in Dispatcher thread, > So the Dispatcher thread will be blocked by PublicLocalizer#addResource for > long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2003: -- Attachment: 0008-YARN-2003.patch Fixing a test issue. > Support to process Job priority from Submission Context in > AppAttemptAddedSchedulerEvent [RM side] > -- > > Key: YARN-2003 > URL: https://issues.apache.org/jira/browse/YARN-2003 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, > 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, > 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch > > > AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from > Submission Context and store. > Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500481#comment-14500481 ] Hadoop QA commented on YARN-3136: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726204/00012-YARN-3136.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7379//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7379//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7379//console This message is automatically generated. > getTransferredContainers can be a bottleneck during AM registration > --- > > Key: YARN-3136 > URL: https://issues.apache.org/jira/browse/YARN-3136 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, > 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, > 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, > 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, > 0009-YARN-3136.patch > > > While examining RM stack traces on a busy cluster I noticed a pattern of AMs > stuck waiting for the scheduler lock trying to call getTransferredContainers. > The scheduler lock is highly contended, especially on a large cluster with > many nodes heartbeating, and it would be nice if we could find a way to > eliminate the need to grab this lock during this call. We've already done > similar work during AM allocate calls to make sure they don't needlessly grab > the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500484#comment-14500484 ] Hadoop QA commented on YARN-3487: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726214/YARN-3487.003.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1207 javac compiler warnings (more than the trunk's current 1181 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServices org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.recovery.TestLeveldbRMStateStore org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerUtils org.apache.hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRTestTests org.apache.hadoop.yarn.server.resourcemanager.TestRMRestaTesTests org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorizatTests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7380//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7380//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7380//console This message is automatically generated. > CapacityScheduler scheduler lock obtained unnecessarily > --- > > Key: YARN-3487 > URL: https://issues.apache.org/jira/browse/YARN-3487 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-3487.001.patch, YARN-3487.002.patch, > YARN-3487.003.patch > > > Recently saw a significant slowdown of applications on a large cluster, and > we noticed there were a large number of blocked threads on the RM. Most of > the blocked threads were waiting for the CapacityScheduler lock while calling > getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500501#comment-14500501 ] Hadoop QA commented on YARN-3410: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726211/0004-YARN-3410.patch against trunk revision c6b5203. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1207 javac compiler warnings (more than the trunk's current 1181 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesFairScheduler org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7381//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7381//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7381//console This message is automatically generated. > YARN admin should be able to remove individual application records from > RMStateStore > > > Key: YARN-3410 > URL: https://issues.apache.org/jira/browse/YARN-3410 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Reporter: Wangda Tan >Assignee: Rohith >Priority: Critical > Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, > 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, > 0004-YARN-3410.patch > > > When RM state store entered an unexpected state, one example is YARN-2340, > when an attempt is not in final state but app already completed, RM can never > get up unless format RMStateStore. > I think we should support remove individual application records from > RMStateStore to unblock RM admin make choice of either waiting for a fix or > format state store. > In addition, RM should be able to report all fatal errors (which will > shutdown RM) when doing app recovery, this can save admin some time to remove > apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500506#comment-14500506 ] Sunil G commented on YARN-3136: --- Hi [~jianhe] I used below suppression {noformat} {noformat} But still i get same problem. Cud i try to suppress with field level? Pls suggest. > getTransferredContainers can be a bottleneck during AM registration > --- > > Key: YARN-3136 > URL: https://issues.apache.org/jira/browse/YARN-3136 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, > 00011-YARN-3136.patch, 00012-YARN-3136.patch, 0002-YARN-3136.patch, > 0003-YARN-3136.patch, 0004-YARN-3136.patch, 0005-YARN-3136.patch, > 0006-YARN-3136.patch, 0007-YARN-3136.patch, 0008-YARN-3136.patch, > 0009-YARN-3136.patch > > > While examining RM stack traces on a busy cluster I noticed a pattern of AMs > stuck waiting for the scheduler lock trying to call getTransferredContainers. > The scheduler lock is highly contended, especially on a large cluster with > many nodes heartbeating, and it would be nice if we could find a way to > eliminate the need to grab this lock during this call. We've already done > similar work during AM allocate calls to make sure they don't needlessly grab > the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500513#comment-14500513 ] Hadoop QA commented on YARN-2003: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726239/0008-YARN-2003.patch against trunk revision c6b5203. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7384//console This message is automatically generated. > Support to process Job priority from Submission Context in > AppAttemptAddedSchedulerEvent [RM side] > -- > > Key: YARN-2003 > URL: https://issues.apache.org/jira/browse/YARN-2003 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Sunil G >Assignee: Sunil G > Attachments: 0001-YARN-2003.patch, 0002-YARN-2003.patch, > 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, > 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch > > > AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from > Submission Context and store. > Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500526#comment-14500526 ] Rohith commented on YARN-3410: -- All tests failed with BindException.. Jenkins need to kick off again to get another report!! > YARN admin should be able to remove individual application records from > RMStateStore > > > Key: YARN-3410 > URL: https://issues.apache.org/jira/browse/YARN-3410 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Reporter: Wangda Tan >Assignee: Rohith >Priority: Critical > Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, > 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, > 0004-YARN-3410.patch > > > When RM state store entered an unexpected state, one example is YARN-2340, > when an attempt is not in final state but app already completed, RM can never > get up unless format RMStateStore. > I think we should support remove individual application records from > RMStateStore to unblock RM admin make choice of either waiting for a fix or > format state store. > In addition, RM should be able to report all fatal errors (which will > shutdown RM) when doing app recovery, this can save admin some time to remove > apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
Junping Du created YARN-3505: Summary: Node's Log Aggregation Report with SUCCEED should not cached in RMApps Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1402) Related Web UI, CLI changes on exposing client API to check log aggregation status
[ https://issues.apache.org/jira/browse/YARN-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500580#comment-14500580 ] Hudson commented on YARN-1402: -- FAILURE: Integrated in Hadoop-trunk-Commit #7606 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7606/]) YARN-1402. Update related Web UI and CLI with exposing client API to check log aggregation status. Contributed by Xuan Gong. (junping_du: rev 1db355a875c3ecc40a244045c6812e00c8d36ef1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/logaggregationstatus/TestRMAppLogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/LogAggregationReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppLogAggregationStatusBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/ApplicationCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ProtoUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/records/LogAggregationStatus.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/LogAggregationReport.java > Related Web UI, CLI changes on exposing client API to check log aggregation > status > -- > > Key: YARN-1402 > URL: https://issues.apache.org/jira/browse/YARN-1402 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xuan Gong >Assignee: Xuan Gong > Fix For: 2.8.0 > > Attachments: YARN-1402.1.patch, YARN-1402.2.patch, > YARN-1402.3.1.patch, YARN-1402.3.2.patch, YARN-1402.3.patch, YARN-1402.4.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Attachment: YARN-3493.4.patch > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(579)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.had
[jira] [Commented] (YARN-2696) Queue sorting in CapacityScheduler should consider node label
[ https://issues.apache.org/jira/browse/YARN-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500631#comment-14500631 ] Hudson commented on YARN-2696: -- FAILURE: Integrated in Hadoop-trunk-Commit #7607 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7607/]) YARN-2696. Queue sorting in CapacityScheduler should consider node label. Contributed by Wangda Tan (jianhe: rev d573f09fb93dbb711d504620af5d73840ea063a6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/PartitionedQueueComparator.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ReservationQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/ResourceUsage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/QueueCapacities.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestChildQueueOrder.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestNodeLabelContainerAllocation.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java > Queue sorting in CapacityScheduler should consider node label > - > > Key: YARN-2696 > URL: https://issues.apache.org/jira/browse/YARN-2696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.8.0 > > Attachments: YARN-2696.1.patch, YARN-2696.2.patch, YARN-2696.3.patch, > YARN-2696.4.patch > > > In the past, when trying to allocate containers under a parent queue in > CapacityScheduler. The parent queue will choose child queues by the used > resource from smallest to largest. > Now we support node label in
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500672#comment-14500672 ] Zhijie Shen commented on YARN-3431: --- [~sjlee0], how about we doing this? Instead of using relates_to/is_related_to to store the parent-child relationship, we put them into info section. Then, we can search for this parent-child relationship quickly, and we don't disturb the normal usage of relates_to/is_related_to. Does it sound good? > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch > > > We have TimelineEntity and some other entities as subclass that inherit from > it. However, we only have a single endpoint, which consume TimelineEntity > rather than sub-classes and this endpoint will check the incoming request > body contains exactly TimelineEntity object. However, the json data which is > serialized from sub-class object seems not to be treated as an TimelineEntity > object, and won't be deserialized into the corresponding sub-class object > which cause deserialization failure as some discussions in YARN-3334 : > https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3136: -- Attachment: 00013-YARN-3136.patch > getTransferredContainers can be a bottleneck during AM registration > --- > > Key: YARN-3136 > URL: https://issues.apache.org/jira/browse/YARN-3136 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, > 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, > 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, > 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, > 0008-YARN-3136.patch, 0009-YARN-3136.patch > > > While examining RM stack traces on a busy cluster I noticed a pattern of AMs > stuck waiting for the scheduler lock trying to call getTransferredContainers. > The scheduler lock is highly contended, especially on a large cluster with > many nodes heartbeating, and it would be nice if we could find a way to > eliminate the need to grab this lock during this call. We've already done > similar work during AM allocate calls to make sure they don't needlessly grab > the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3136) getTransferredContainers can be a bottleneck during AM registration
[ https://issues.apache.org/jira/browse/YARN-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500680#comment-14500680 ] Jian He commented on YARN-3136: --- Hi [~sunilg], we can suppress at the filed level. Just uploaded a patch myself. > getTransferredContainers can be a bottleneck during AM registration > --- > > Key: YARN-3136 > URL: https://issues.apache.org/jira/browse/YARN-3136 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Sunil G > Attachments: 0001-YARN-3136.patch, 00010-YARN-3136.patch, > 00011-YARN-3136.patch, 00012-YARN-3136.patch, 00013-YARN-3136.patch, > 0002-YARN-3136.patch, 0003-YARN-3136.patch, 0004-YARN-3136.patch, > 0005-YARN-3136.patch, 0006-YARN-3136.patch, 0007-YARN-3136.patch, > 0008-YARN-3136.patch, 0009-YARN-3136.patch > > > While examining RM stack traces on a busy cluster I noticed a pattern of AMs > stuck waiting for the scheduler lock trying to call getTransferredContainers. > The scheduler lock is highly contended, especially on a large cluster with > many nodes heartbeating, and it would be nice if we could find a way to > eliminate the need to grab this lock during this call. We've already done > similar work during AM allocate calls to make sure they don't needlessly grab > the scheduler lock, and it would be good to do so here as well, if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500737#comment-14500737 ] Sangjin Lee commented on YARN-3431: --- Then would it be mapped as something like "PARENT" => parent entity identifier, etc.? > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch > > > We have TimelineEntity and some other entities as subclass that inherit from > it. However, we only have a single endpoint, which consume TimelineEntity > rather than sub-classes and this endpoint will check the incoming request > body contains exactly TimelineEntity object. However, the json data which is > serialized from sub-class object seems not to be treated as an TimelineEntity > object, and won't be deserialized into the corresponding sub-class object > which cause deserialization failure as some discussions in YARN-3334 : > https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500755#comment-14500755 ] Zhijie Shen commented on YARN-3431: --- Could be. Or addParent(entity_type, entity_id) is translated to info.add("PARENT:" + entity_type, entity_id), and child similarly. I think former is better for get a set of children while latter will be good to search for single child/parent. > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch > > > We have TimelineEntity and some other entities as subclass that inherit from > it. However, we only have a single endpoint, which consume TimelineEntity > rather than sub-classes and this endpoint will check the incoming request > body contains exactly TimelineEntity object. However, the json data which is > serialized from sub-class object seems not to be treated as an TimelineEntity > object, and won't be deserialized into the corresponding sub-class object > which cause deserialization failure as some discussions in YARN-3334 : > https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500760#comment-14500760 ] Hadoop QA commented on YARN-3493: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726253/YARN-3493.4.patch against trunk revision 1db355a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7385//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7385//console This message is automatically generated. > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-1
[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500767#comment-14500767 ] Wangda Tan commented on YARN-3487: -- Re-triggerred Jenkins > CapacityScheduler scheduler lock obtained unnecessarily > --- > > Key: YARN-3487 > URL: https://issues.apache.org/jira/browse/YARN-3487 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-3487.001.patch, YARN-3487.002.patch, > YARN-3487.003.patch > > > Recently saw a significant slowdown of applications on a large cluster, and > we noticed there were a large number of blocked threads on the RM. Most of > the blocked threads were waiting for the CapacityScheduler lock while calling > getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500787#comment-14500787 ] Wangda Tan commented on YARN-3493: -- Some comments: - validateResourceRequest -> normalizeAndValidateRequest and make the isRecovery to be false, and it should be package-visible - normalizeNodeLabelForRequest -> normalizeNodeLabelExpressionInRequest, and it should be private > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(579)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceM
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500795#comment-14500795 ] Hadoop QA commented on YARN-3410: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726211/0004-YARN-3410.patch against trunk revision d573f09. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7386//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7386//console This message is automatically generated. > YARN admin should be able to remove individual application records from > RMStateStore > > > Key: YARN-3410 > URL: https://issues.apache.org/jira/browse/YARN-3410 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager, yarn >Reporter: Wangda Tan >Assignee: Rohith >Priority: Critical > Attachments: 0001-YARN-3410-v1.patch, 0001-YARN-3410.patch, > 0001-YARN-3410.patch, 0002-YARN-3410.patch, 0003-YARN-3410.patch, > 0004-YARN-3410.patch > > > When RM state store entered an unexpected state, one example is YARN-2340, > when an attempt is not in final state but app already completed, RM can never > get up unless format RMStateStore. > I think we should support remove individual application records from > RMStateStore to unblock RM admin make choice of either waiting for a fix or > format state store. > In addition, RM should be able to report all fatal errors (which will > shutdown RM) when doing app recovery, this can save admin some time to remove > apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Attachment: YARN-3493.5.patch Thanks for the review. updated the patch accordingly > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(579)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.Abs
[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Attachment: (was: YARN-3493.5.patch) > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(579)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org
[jira] [Updated] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3493: -- Attachment: YARN-3493.5.patch > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(579)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > a
[jira] [Commented] (YARN-3451) Add start time and Elapsed in ApplicationAttemptReport and display the same in RMAttemptBlock WebUI
[ https://issues.apache.org/jira/browse/YARN-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500843#comment-14500843 ] Jian He commented on YARN-3451: --- looks good, +1 > Add start time and Elapsed in ApplicationAttemptReport and display the same > in RMAttemptBlock WebUI > --- > > Key: YARN-3451 > URL: https://issues.apache.org/jira/browse/YARN-3451 > Project: Hadoop YARN > Issue Type: Improvement > Components: api, webapp >Reporter: Rohith >Assignee: Rohith > Attachments: 0001-YARN-3451.patch, 0001-YARN-3451.patch, Screen Shot > 2015-04-11 at 12.38.05 AM.png > > > Unlike ApplicationReport and ApplicationBlock has *Started:* and *Elapsed:* > time, It would be usefull if start time and Elapsed is sent in > ApplicationAttemptReport and display in ApplicationAttemptBlock. > This gives granular debugging ability when analyzing issue with multiple > attempt failure like attempt timedout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3451) Add start time and Elapsed in ApplicationAttemptReport and display the same in RMAttemptBlock WebUI
[ https://issues.apache.org/jira/browse/YARN-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500867#comment-14500867 ] Hudson commented on YARN-3451: -- FAILURE: Integrated in Hadoop-trunk-Commit #7608 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7608/]) YARN-3451. Display attempt start time and elapsed time on the web UI. Contributed by Rohith Sharmaks (jianhe: rev 6779467ab6fcc6a02d0e8c80b138cc9df1aa831e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationAttemptReportPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationAttemptReport.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestYarnClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestYarnCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/ProtocolHATestBase.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java > Add start time and Elapsed in ApplicationAttemptReport and display the same > in RMAttemptBlock WebUI > --- > > Key: YARN-3451 > URL: https://issues.apache.org/jira/browse/YARN-3451 > Project: Hadoop YARN > Issue Type: Improvement > Components: api, webapp >Reporter: Rohith >Assignee: Rohith > Fix For: 2.8.0 > > Attachments: 0001-YARN-3451.patch, 0001-YARN-3451.patch, Screen Shot > 2015-04-11 at 12.38.05 AM.png > > > Unlike ApplicationReport and ApplicationBlock has *Started:* and *Elapsed:* > time, It would be usefull if start time and Elapsed is sent in > ApplicationAttemptReport and display in ApplicationAttemptBlock. > This gives granular debugging ability when analyzing issue with multiple > attempt failure like attempt timedout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3497) ContainerManagementProtocolProxy modifies IPC timeout conf without making a copy
[ https://issues.apache.org/jira/browse/YARN-3497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500873#comment-14500873 ] Jian He commented on YARN-3497: --- should we change below to use this.conf as well ? {code} conf.setInt( CommonConfigurationKeysPublic.IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY, 0); {code} > ContainerManagementProtocolProxy modifies IPC timeout conf without making a > copy > > > Key: YARN-3497 > URL: https://issues.apache.org/jira/browse/YARN-3497 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-3497.001.patch > > > yarn-client's ContainerManagementProtocolProxy is updating > ipc.client.connection.maxidletime in the conf passed in without making a copy > of it. That modification "leaks" into other systems using the same conf and > can cause them to setup RPC connections with a timeout of zero as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500891#comment-14500891 ] Hadoop QA commented on YARN-3487: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726214/YARN-3487.003.patch against trunk revision d573f09. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7387//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7387//console This message is automatically generated. > CapacityScheduler scheduler lock obtained unnecessarily > --- > > Key: YARN-3487 > URL: https://issues.apache.org/jira/browse/YARN-3487 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-3487.001.patch, YARN-3487.002.patch, > YARN-3487.003.patch > > > Recently saw a significant slowdown of applications on a large cluster, and > we noticed there were a large number of blocked threads on the RM. Most of > the blocked threads were waiting for the CapacityScheduler lock while calling > getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3482) Report NM available resources in heartbeat
[ https://issues.apache.org/jira/browse/YARN-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500894#comment-14500894 ] Inigo Goiri commented on YARN-3482: --- Hi Sunil G, yes, I'm talking about Total CPU and Total Memory. Combining this with YARN-3481, we can estimate the load in the node that is not caused by the containers (external processes). Right now, the server could be overloaded by HBase for example and we would be sending more load there. As Karthik Kambatla mentions, this would be a very conservative scenario where the external processes have absolute priority. This might be a desired behavior for some users but the proposal is to also add an interface to dynamically change the amount of available resources according to the behavior of the external processes. Both approaches target the same problem and are complementary/orthogonal. I understand this other approach of sending node utilization might be a little out of the scope of this JIRA but I could open a new one with this functionality. > Report NM available resources in heartbeat > -- > > Key: YARN-3482 > URL: https://issues.apache.org/jira/browse/YARN-3482 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.7.0 >Reporter: Inigo Goiri > Original Estimate: 504h > Remaining Estimate: 504h > > NMs are usually collocated with other processes like HDFS, Impala or HBase. > To manage this scenario correctly, YARN should be aware of the actual > available resources. The proposal is to have an interface to dynamically > change the available resources and report this to the RM in every heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500934#comment-14500934 ] Hadoop QA commented on YARN-3493: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726282/YARN-3493.5.patch against trunk revision d573f09. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7388//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7388//console This message is automatically generated. > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
[jira] [Commented] (YARN-3463) Integrate OrderingPolicy Framework with CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500937#comment-14500937 ] Jian He commented on YARN-3463: --- - this.schedulableEntities != null can never happen? since each time it’s re-creating a new Ordering class. I think we can just initialize {{this.comparator}} and {{this.schedulableEntities}} inside FifoOrderingPolicy constructor and remove the setComparator method {code} if (this.schedulableEntities != null) { schedulableEntities.addAll(this.schedulableEntities); } {code} - this should be inside the {removed} check ? otherwise any unknown container completed will cause reordering {code} orderingPolicy.containerReleased(application, rmContainer); {code} - getStatusMessage -> getInfo ? > Integrate OrderingPolicy Framework with CapacityScheduler > - > > Key: YARN-3463 > URL: https://issues.apache.org/jira/browse/YARN-3463 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Craig Welch >Assignee: Craig Welch > Attachments: YARN-3463.50.patch, YARN-3463.61.patch, > YARN-3463.64.patch, YARN-3463.65.patch, YARN-3463.66.patch, > YARN-3463.67.patch, YARN-3463.68.patch > > > Integrate the OrderingPolicy Framework with the CapacityScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500940#comment-14500940 ] Jian He commented on YARN-3493: --- TestAMRestart failure unrelated, passing locally. > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1031) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1071) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1208) > 2015-04-15 13:19:18,623 ERROR resourcemanager.ResourceManager > (ResourceManager.java:serviceStart(579)) - Failed to load/recover state > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.
[jira] [Updated] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3487: - Summary: CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue (was: CapacityScheduler scheduler lock obtained unnecessarily) > CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue > - > > Key: YARN-3487 > URL: https://issues.apache.org/jira/browse/YARN-3487 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-3487.001.patch, YARN-3487.002.patch, > YARN-3487.003.patch > > > Recently saw a significant slowdown of applications on a large cluster, and > we noticed there were a large number of blocked threads on the RM. Most of > the blocked threads were waiting for the CapacityScheduler lock while calling > getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3493) RM fails to come up with error "Failed to load/recover state" when mem settings are changed
[ https://issues.apache.org/jira/browse/YARN-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500946#comment-14500946 ] Hudson commented on YARN-3493: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7609 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7609/]) YARN-3493. RM fails to come up with error "Failed to load/recover state" when mem settings are changed. (Jian He via wangda) (wangda: rev f65eeb412d140a3808bcf99344a9f3a965918f70) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMServerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/CHANGES.txt > RM fails to come up with error "Failed to load/recover state" when mem > settings are changed > > > Key: YARN-3493 > URL: https://issues.apache.org/jira/browse/YARN-3493 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.7.0 >Reporter: Sumana Sathish >Assignee: Jian He >Priority: Critical > Fix For: 2.8.0 > > Attachments: YARN-3493.1.patch, YARN-3493.2.patch, YARN-3493.3.patch, > YARN-3493.4.patch, YARN-3493.5.patch, yarn-yarn-resourcemanager.log.zip > > > RM fails to come up for the following case: > 1. Change yarn.nodemanager.resource.memory-mb and > yarn.scheduler.maximum-allocation-mb to 4000 in yarn-site.xml > 2. Start a randomtextwriter job with mapreduce.map.memory.mb=4000 in > background and wait for the job to reach running state > 3. Restore yarn-site.xml to have yarn.scheduler.maximum-allocation-mb to 2048 > before the above job completes > 4. Restart RM > 5. RM fails to come up with the below error > {code:title= RM error for Mem settings changed} > - RM app submission failed in validating AM resource request for application > application_1429094976272_0008 > org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid > resource request, requested memory < 0, or requested memory > max configured, > requestedMemory=3072, maxMemory=2048 > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:204) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:328) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:317) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:422) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1187) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:994) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1031) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.j
[jira] [Commented] (YARN-3487) CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue
[ https://issues.apache.org/jira/browse/YARN-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500954#comment-14500954 ] Hudson commented on YARN-3487: -- FAILURE: Integrated in Hadoop-trunk-Commit #7610 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7610/]) YARN-3487. CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue (Jason Lowe via wangda) (wangda: rev f47a5763acd55cb0b3f16152c7f8df06ec0e09a9) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/CHANGES.txt > CapacityScheduler scheduler lock obtained unnecessarily when calling getQueue > - > > Key: YARN-3487 > URL: https://issues.apache.org/jira/browse/YARN-3487 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 2.7.1 > > Attachments: YARN-3487.001.patch, YARN-3487.002.patch, > YARN-3487.003.patch > > > Recently saw a significant slowdown of applications on a large cluster, and > we noticed there were a large number of blocked threads on the RM. Most of > the blocked threads were waiting for the CapacityScheduler lock while calling > getQueueInfo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3506) Error handling on NM reporting invalid NodeLabels in distributed Node Label configuration
Naganarasimha G R created YARN-3506: --- Summary: Error handling on NM reporting invalid NodeLabels in distributed Node Label configuration Key: YARN-3506 URL: https://issues.apache.org/jira/browse/YARN-3506 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R As per [conclusion|https://issues.apache.org/jira/browse/YARN-2495?focusedCommentId=14358109&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14358109] in YARN-2495, following error handling needs to be done * Show/log diagnostic in RM (nodes) page and NM page, saying label is invalid. (Need modify web UI, can be done in a separated task) * Make the node's labels to be empty, so that applications can continue use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3504) TestRMRestart fails occasionally in trunk
[ https://issues.apache.org/jira/browse/YARN-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500968#comment-14500968 ] Naganarasimha G R commented on YARN-3504: - seems like duplicate of YARN-2871 > TestRMRestart fails occasionally in trunk > - > > Key: YARN-3504 > URL: https://issues.apache.org/jira/browse/YARN-3504 > Project: Hadoop YARN > Issue Type: Test >Reporter: Xuan Gong >Priority: Minor > > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > Stacktrace > org.mockito.exceptions.verification.TooLittleActualInvocations: > rMAppManager.logApplicationSummary( > isA(org.apache.hadoop.yarn.api.records.ApplicationId) > ); > Wanted 3 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969) > But was 2 times: > -> at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:969) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500975#comment-14500975 ] Naganarasimha G R commented on YARN-2740: - Thanks for the comments [~wangda], bq. I think it's not a big problem, NM doesn't need to know "x" being removed, the logic should be, NM reports label, and RM allocate according to label, NM should just move on if adding label failed Well IIUC, based on your reply to first point ??prevent admin remove clusterNodeLabel when distributed enabled?? we need worry about this second point right? as user will not be able to remove cluster node label bq. as what we done in YARN-2495. My opinion here is not add extra RM->NM communicate. As per last discussion in YARN-2495 you had given a concluded as per this [comment |https://issues.apache.org/jira/browse/YARN-2495?focusedCommentId=14358109&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14358109] that {quote} * Show/log diagnostic in RM (nodes) page and NM page, saying label is invalid. (Need modify web UI, can be done in a separated task) * Make the node's labels to be empty, so that applications can continue use it. {quote} based on this i mentioned RM->NM communicate/notify would be required as labels are sent only on change in NM side and it will not be able show that there is error in reporting labels. In btw have raised new jira YARN-3506 for this error handling reported in YARN-2495 Test failure is not related to this patch and will work on {{prevent admin remove clusterNodeLabel when distributed enabled.}} and resubmit the patch. > ResourceManager side should properly handle node label modifications when > distributed node label configuration enabled > -- > > Key: YARN-2740 > URL: https://issues.apache.org/jira/browse/YARN-2740 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > Fix For: 2.8.0 > > Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, > YARN-2740.20150327-1.patch, YARN-2740.20150411-1.patch, > YARN-2740.20150411-2.patch, YARN-2740.20150411-3.patch, > YARN-2740.20150417-1.patch > > > According to YARN-2495, when distributed node label configuration is enabled: > - RMAdmin / REST API should reject change labels on node operations. > - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do > heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3431) Sub resources of timeline entity needs to be passed to a separate endpoint.
[ https://issues.apache.org/jira/browse/YARN-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-3431: -- Attachment: YARN-3431.4.patch Upload a new patch: 1. Use info to store parent/children relationship. 2. Fix a couple bugs that I found in local test. 3. Change prototype to real. 4. Set Id from the parts. 5. Add the javadoc to describe the constructor of the proxy case. > Sub resources of timeline entity needs to be passed to a separate endpoint. > --- > > Key: YARN-3431 > URL: https://issues.apache.org/jira/browse/YARN-3431 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-3431.1.patch, YARN-3431.2.patch, YARN-3431.3.patch, > YARN-3431.4.patch > > > We have TimelineEntity and some other entities as subclass that inherit from > it. However, we only have a single endpoint, which consume TimelineEntity > rather than sub-classes and this endpoint will check the incoming request > body contains exactly TimelineEntity object. However, the json data which is > serialized from sub-class object seems not to be treated as an TimelineEntity > object, and won't be deserialized into the corresponding sub-class object > which cause deserialization failure as some discussions in YARN-3334 : > https://issues.apache.org/jira/browse/YARN-3334?focusedCommentId=14391059&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14391059. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS
[ https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14500998#comment-14500998 ] Zhijie Shen commented on YARN-3046: --- In the new patch, the task entity ID is set correctly. But can we use reflect or make "getTaskId" the interface method of TaskEvent to simply the code change? > [Event producers] Implement MapReduce AM writing some MR metrics to ATS > --- > > Key: YARN-3046 > URL: https://issues.apache.org/jira/browse/YARN-3046 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Junping Du > Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, > YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch, > YARN-3046-v3.patch > > > Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes > written) and have the MR AM write the framework-specific metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
[ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501002#comment-14501002 ] zhihai xu commented on YARN-3491: - I did more profiling in checkLocalDir. It really surprised me. The most time-consuming code is status.getPermission() not lfs.getFileStatus. status.getPermission() will take 4 or 5 ms. checkLocalDir will call status.getPermission() three times. That is why checkLocalDir take 10+ms. {code} private boolean checkLocalDir(String localDir) { Map pathPermissionMap = getLocalDirsPathPermissionsMap(localDir); for (Map.Entry entry : pathPermissionMap.entrySet()) { FileStatus status; try { status = lfs.getFileStatus(entry.getKey()); } catch (Exception e) { String msg = "Could not carry out resource dir checks for " + localDir + ", which was marked as good"; LOG.warn(msg, e); throw new YarnRuntimeException(msg, e); } if (!status.getPermission().equals(entry.getValue())) { String msg = "Permissions incorrectly set for dir " + entry.getKey() + ", should be " + entry.getValue() + ", actual value = " + status.getPermission(); LOG.warn(msg); throw new YarnRuntimeException(msg); } } return true; } {code} Then I go deeper into the source code I find out why status.getPermission take the most of time: lfs.getFileStatus will return RawLocalFileSystem#DeprecatedRawLocalFileStatus, {code} public FsPermission getPermission() { if (!isPermissionLoaded()) { loadPermissionInfo(); } return super.getPermission(); } {code} So status.getPermission will call loadPermissionInfo, Based on the following code, loadPermissionInfo is bottle neck, it will call run "ls -ld" to get the permission, which is really slow. {code} /// loads permissions, owner, and group from `ls -ld` private void loadPermissionInfo() { IOException e = null; try { String output = FileUtil.execCommand(new File(getPath().toUri()), Shell.getGetPermissionCommand()); StringTokenizer t = new StringTokenizer(output, Shell.TOKEN_SEPARATOR_REGEX); //expected format //-rw---1 username groupname ... String permission = t.nextToken(); if (permission.length() > FsPermission.MAX_PERMISSION_LENGTH) { //files with ACLs might have a '+' permission = permission.substring(0, FsPermission.MAX_PERMISSION_LENGTH); } setPermission(FsPermission.valueOf(permission)); t.nextToken(); String owner = t.nextToken(); // If on windows domain, token format is DOMAIN\\user and we want to // extract only the user name if (Shell.WINDOWS) { int i = owner.indexOf('\\'); if (i != -1) owner = owner.substring(i + 1); } setOwner(owner); setGroup(t.nextToken()); } catch (Shell.ExitCodeException ioe) { if (ioe.getExitCode() != 1) { e = ioe; } else { setPermission(null); setOwner(null); setGroup(null); } } catch (IOException ioe) { e = ioe; } finally { if (e != null) { throw new RuntimeException("Error while running command to get " + "file permissions : " + StringUtils.stringifyException(e)); } } } {code} We should call getPermission as least as possible in the future :) > PublicLocalizer#addResource is too slow. > > > Key: YARN-3491 > URL: https://issues.apache.org/jira/browse/YARN-3491 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3491.000.patch, YARN-3491.001.patch > > > Based on the profiling, The bottleneck in PublicLocalizer#addResource is > getInitializedLocalDirs. getInitializedLocalDirs call checkLocalDir. > checkLocalDir is very slow which takes about 10+ ms. > The total delay will be approximately number of local dirs * 10+ ms. > This delay will be added for each public resource localization. > Because PublicLocalizer#addResource is slow, the thread pool can't be fully > utilized. Instead of doing public resource localization in > parallel(multithreading), public resource localization is serialized most of > the time. > And also PublicLocalizer#addResource is running in Dispatcher thread, > So the Dispatcher thread will be blocked by PublicLocalizer#addResource for > long time. -- This message was sent by At
[jira] [Commented] (YARN-3046) [Event producers] Implement MapReduce AM writing some MR metrics to ATS
[ https://issues.apache.org/jira/browse/YARN-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501015#comment-14501015 ] Junping Du commented on YARN-3046: -- bq. In the new patch, the task entity ID is set correctly. But can we use reflect or make "getTaskId" the interface method of TaskEvent to simply the code change? I agree that it could make code more concisely if we can differentiate job entities from task entities with refection or some other ways. Coincidentally, I also put some ideas and comments to MAPREDUCE-6318. However, can we do this refactor together with v1 timeline service in MAPREDUCE-6318? The code here far more than acceptable, I believe. > [Event producers] Implement MapReduce AM writing some MR metrics to ATS > --- > > Key: YARN-3046 > URL: https://issues.apache.org/jira/browse/YARN-3046 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Junping Du > Attachments: YARN-3046-no-test-v2.patch, YARN-3046-no-test.patch, > YARN-3046-v1-rebase.patch, YARN-3046-v1.patch, YARN-3046-v2.patch, > YARN-3046-v3.patch > > > Per design in YARN-2928, select a handful of MR metrics (e.g. HDFS bytes > written) and have the MR AM write the framework-specific metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3044) [Event producers] Implement RM writing app lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501026#comment-14501026 ] Junping Du commented on YARN-3044: -- Thanks [~Naganarasimha] for updating the patch! Looks like it depends on YARN-3390 for current patch, so I will go ahead to review YARN-3390 first and back to your patch soon. > [Event producers] Implement RM writing app lifecycle events to ATS > -- > > Key: YARN-3044 > URL: https://issues.apache.org/jira/browse/YARN-3044 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-3044.20150325-1.patch, YARN-3044.20150406-1.patch, > YARN-3044.20150416-1.patch > > > Per design in YARN-2928, implement RM writing app lifecycle events to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers
[ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501037#comment-14501037 ] Jun Gong commented on YARN-3474: Any comments appreciate. > Add a way to let NM wait RM to come back, not kill running containers > - > > Key: YARN-3474 > URL: https://issues.apache.org/jira/browse/YARN-3474 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.6.0 >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-3474.01.patch > > > When RM HA is enabled and active RM shuts down, standby RM will become > active, recover apps and attempts. Apps will not be affected. > If there are some cases or bugs that cause both RM could not start > normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM > could not connect with ZK well). NM will kill containers running on it when > it could not heartbeat with RM for some time(max retry time is 15 mins by > default). Then all apps will be killed. > In production cluster, we might come across above cases and fixing these bugs > might need time more than 15 mins. In order to let apps not be affected and > killed by NM, YARN admin could set a flag(the flag is a znode > '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to > come back and not kill running containers. After fixing bugs and RM start > normally, clear the flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)