[jira] [Commented] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126407#comment-17126407 ] Tao Yang commented on YARN-10293: - Thanks [~prabhujoseph] for this effort. I'm fine, please go ahead. {quote} Yes sure, YARN-9598 addresses many other issues. Will check how to contribute to the same and address any other optimization required. {quote} Good to hear that, Thanks. For the patch, overall it looks good, some suggestions about the UT: * In TestCapacitySchedulerMultiNodes#testExcessReservationWillBeUnreserved, this patch changes the behavior of second-to-last allocation and make last allocation unnecessary, can you remove line 261 to line 267 to make it more clear? {code} Assert.assertEquals(1, schedulerApp1.getLiveContainers().size()); Assert.assertEquals(0, schedulerApp1.getReservedContainers().size()); -Assert.assertEquals(1, schedulerApp2.getLiveContainers().size()); - -// Trigger scheduling to allocate a container on nm1 for app2. -cs.handle(new NodeUpdateSchedulerEvent(rmNode1)); -Assert.assertNull(cs.getNode(nm1.getNodeId()).getReservedContainer()); -Assert.assertEquals(1, schedulerApp1.getLiveContainers().size()); -Assert.assertEquals(0, schedulerApp1.getReservedContainers().size()); Assert.assertEquals(2, schedulerApp2.getLiveContainers().size()); Assert.assertEquals(7 * GB, cs.getNode(nm1.getNodeId()).getAllocatedResource().getMemorySize()); Assert.assertEquals(12 * GB, cs.getRootQueue().getQueueResourceUsage().getUsed().getMemorySize()); {code} * Can we remove the TestCapacitySchedulerMultiNodesWithPreemption#getFiCaSchedulerApp method and get the scheduler app via calling CapacityScheduler#getApplicationAttempt ? * There are lots of while clauses, Thread#sleep callings and async-thread creation for checking states in TestCapacitySchedulerMultiNodesWithPreemption#testAllocationOfReservationFromOtherNode, could you please calling GenericTestUtils#waitFor, MockRM#waitForState etc. to simplify it? > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserve
[jira] [Commented] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
[ https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126357#comment-17126357 ] Zhankun Tang commented on YARN-10307: - [~appleyuchi], IIRC, I don't think the "Hive on Tez" depends on the timeline service. It seems more like an installation issue. > /leveldb-timeline-store.ldb/LOCK not exist > -- > > Key: YARN-10307 > URL: https://issues.apache.org/jira/browse/YARN-10307 > Project: Hadoop YARN > Issue Type: Bug > Environment: Ubuntu 19.10 > Hadoop 3.1.2 > Tez 0.9.2 > Hbase 2.2.4 >Reporter: appleyuchi >Priority: Blocker > > $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver > > in hadoop-appleyuchi-timelineserver-Desktop.out I get > > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] > 沒有此一檔案或目錄 > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,525 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state > INITED > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,526 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer > failed in state INITED > org.apache.hadoop.service.ServiceStateException: > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > Caused by: java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils
[jira] [Commented] (YARN-6857) Support REST for Node Attributes configurations
[ https://issues.apache.org/jira/browse/YARN-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126161#comment-17126161 ] Hadoop QA commented on YARN-6857: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 56s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 41s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 17s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}156m 8s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26114/artifact/out/Dockerfile | | JIRA Issue | YARN-6857 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004844/YARN-6857.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7dc6b4b9cafc 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 5157118bd7f | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/2
[jira] [Commented] (YARN-9903) Support reservations continue looking for Node Labels
[ https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126160#comment-17126160 ] Eric Payne commented on YARN-9903: -- [~Jim_Brennan] it looks like the patch applies cleanly all the way back to 2.10. > Support reservations continue looking for Node Labels > - > > Key: YARN-9903 > URL: https://issues.apache.org/jira/browse/YARN-9903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tarun Parimi >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9903.001.patch, YARN-9903.002.patch > > > YARN-1769 brought in reservations continue looking feature which improves the > several resource reservation scenarios. However, it is not handled currently > when nodes have a label assigned to them. This is useful and in many cases > necessary even for Node Labels. So we should look to support this for node > labels also. > For example, in AbstractCSQueue.java, we have the below TODO. > {code:java} > // TODO, now only consider reservation cases when the node has no label > if (this.reservationsContinueLooking && nodePartition.equals( > RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, > clusterResource, resourceCouldBeUnreserved, Resources.none())) { > {code} > cc [~sunilg] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126100#comment-17126100 ] Eric Payne edited comment on YARN-10283 at 6/4/20, 7:44 PM: -[~Jim_Brennan] it looks like the patch applies cleanly all the way back to 2.10.- Sorry, this was placed in the wrong JIRA. was (Author: eepayne): -[~Jim_Brennan], it looks like the patch applies cleanly all the way back to 2.10.] Sorry, this was placed in the wrong JIRA. > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, > YARN-10283-ReproTest2.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126100#comment-17126100 ] Eric Payne edited comment on YARN-10283 at 6/4/20, 7:43 PM: -[~Jim_Brennan], it looks like the patch applies cleanly all the way back to 2.10.] Sorry, this was placed in the wrong JIRA. was (Author: eepayne): [~Jim_Brennan], it looks like the patch applies cleanly all the way back to 2.10. > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, > YARN-10283-ReproTest2.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10300) appMasterHost not set in RM ApplicationSummary when AM fails before first heartbeat
[ https://issues.apache.org/jira/browse/YARN-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126158#comment-17126158 ] Eric Payne commented on YARN-10300: --- Thanks for raising this issue and for providing the fix. The code changes look good. However, I have a couple of comments on the unit test. The unit test succeeds with and without the changes. The unit test tests whether {{app.getCurrentAppAttempt().getMasterContainer().getNodeId().getHost()}} returns the current host, not whether {{RMAppManager#createAppSummary}} fills in the AM host name prior to node heartbeat. I don't know how hard it would be to call {{RMAppManager#createAppSummary}} directly, but if it's possible, I think the unit test should do that and then check that the SummaryBuilder has the host and port filled in. Do you think that's possible? > appMasterHost not set in RM ApplicationSummary when AM fails before first > heartbeat > --- > > Key: YARN-10300 > URL: https://issues.apache.org/jira/browse/YARN-10300 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-10300.001.patch, YARN-10300.002.patch > > > {noformat} > 2020-05-23 14:09:10,086 INFO resourcemanager.RMAppManager$ApplicationSummary: > appId=application_1586003420099_12444961,name=job_name,user=username,queue=queuename,state=FAILED,trackingUrl=https > > ://cluster:port/applicationhistory/app/application_1586003420099_12444961,appMasterHost=N/A,startTime=1590241207309,finishTime=1590242950085,finalStatus=FAILED,memorySeconds=13750,vcoreSeconds=67,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources= vCores:0>,applicationType=MAPREDUCE > {noformat} > {{appMasterHost=N/A}} should have the AM hostname instead of N/A -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126128#comment-17126128 ] Jim Brennan commented on YARN-10283: Thanks [~epayne]. Just to be clear, you are referring to the patch for YARN-9903? > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, > YARN-10283-ReproTest2.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9767) PartitionQueueMetrics Issues
[ https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne resolved YARN-9767. -- Resolution: Duplicate > PartitionQueueMetrics Issues > > > Key: YARN-9767 > URL: https://issues.apache.org/jira/browse/YARN-9767 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Attachments: YARN-9767.001.patch > > > The intent of the Jira is to capture the issues/observations encountered as > part of YARN-6492 development separately for ease of tracking. > Observations: > Please refer this > https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027 > 1. Since partition info are being extracted from request and node, there is a > problem. For example, > > Node N has been mapped to Label X (Non exclusive). Queue A has been > configured with ANY Node label. App A requested resources from Queue A and > its containers ran on Node N for some reasons. During > AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) > would get used for calculation. Lets say allocate call has been fired for 3 > containers of 1 GB each, then > a. PartitionDefault * queue A -> pending mb is 3 GB > b. PartitionX * queue A -> pending mb is -3 GB > > is the outcome. Because app request has been fired without any label > specification and #a metrics has been derived. After allocation is over, > pending resources usually gets decreased. When this happens, it use node > partition info. hence #b metrics has derived. > > Given this kind of situation, We will need to put some thoughts on achieving > the metrics correctly. > > 2. Though the intent of this jira is to do Partition Queue Metrics, we would > like to retain the existing Queue Metrics for backward compatibility (as you > can see from jira's discussion). > With this patch and YARN-9596 patch, queuemetrics (for queue's) would be > overridden either with some specific partition values or default partition > values. It could be vice - versa as well. For example, after the queues (say > queue A) has been initialised with some min and max cap and also with node > label's min and max cap, Queuemetrics (availableMB) for queue A return values > based on node label's cap config. > I've been working on these observations to provide a fix and attached > .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, > availableVcores is correct (Please refer above #2 observation). Added more > asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for > #2 is working properly. > Also one more thing to note is, user metrics for availableMB, availableVcores > at root queue was not there even before. Retained the same behaviour. User > metrics for availableMB, availableVcores is available only at child queue's > level and also with partitions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9767) PartitionQueueMetrics Issues
[ https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reopened YARN-9767: -- Reopening so can close as dup of YARN-6492 > PartitionQueueMetrics Issues > > > Key: YARN-9767 > URL: https://issues.apache.org/jira/browse/YARN-9767 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Attachments: YARN-9767.001.patch > > > The intent of the Jira is to capture the issues/observations encountered as > part of YARN-6492 development separately for ease of tracking. > Observations: > Please refer this > https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027 > 1. Since partition info are being extracted from request and node, there is a > problem. For example, > > Node N has been mapped to Label X (Non exclusive). Queue A has been > configured with ANY Node label. App A requested resources from Queue A and > its containers ran on Node N for some reasons. During > AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) > would get used for calculation. Lets say allocate call has been fired for 3 > containers of 1 GB each, then > a. PartitionDefault * queue A -> pending mb is 3 GB > b. PartitionX * queue A -> pending mb is -3 GB > > is the outcome. Because app request has been fired without any label > specification and #a metrics has been derived. After allocation is over, > pending resources usually gets decreased. When this happens, it use node > partition info. hence #b metrics has derived. > > Given this kind of situation, We will need to put some thoughts on achieving > the metrics correctly. > > 2. Though the intent of this jira is to do Partition Queue Metrics, we would > like to retain the existing Queue Metrics for backward compatibility (as you > can see from jira's discussion). > With this patch and YARN-9596 patch, queuemetrics (for queue's) would be > overridden either with some specific partition values or default partition > values. It could be vice - versa as well. For example, after the queues (say > queue A) has been initialised with some min and max cap and also with node > label's min and max cap, Queuemetrics (availableMB) for queue A return values > based on node label's cap config. > I've been working on these observations to provide a fix and attached > .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, > availableVcores is correct (Please refer above #2 observation). Added more > asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for > #2 is working properly. > Also one more thing to note is, user metrics for availableMB, availableVcores > at root queue was not there even before. Retained the same behaviour. User > metrics for availableMB, availableVcores is available only at child queue's > level and also with partitions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126100#comment-17126100 ] Eric Payne commented on YARN-10283: --- [~Jim_Brennan], it looks like the patch applies cleanly all the way back to 2.10. > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch, > YARN-10283-ReproTest2.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6857) Support REST for Node Attributes configurations
[ https://issues.apache.org/jira/browse/YARN-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126080#comment-17126080 ] Bilwa S T commented on YARN-6857: - Fixed checkstyle and UT > Support REST for Node Attributes configurations > --- > > Key: YARN-6857 > URL: https://issues.apache.org/jira/browse/YARN-6857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, capacityscheduler, client >Reporter: Naganarasimha G R >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-6857-YARN-3409.001.patch, YARN-6857.002.patch, > YARN-6857.003.patch > > > This jira focusses on supporting mapping of Nodes to Attributes through REST -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6857) Support REST for Node Attributes configurations
[ https://issues.apache.org/jira/browse/YARN-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-6857: Attachment: YARN-6857.003.patch > Support REST for Node Attributes configurations > --- > > Key: YARN-6857 > URL: https://issues.apache.org/jira/browse/YARN-6857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, capacityscheduler, client >Reporter: Naganarasimha G R >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-6857-YARN-3409.001.patch, YARN-6857.002.patch, > YARN-6857.003.patch > > > This jira focusses on supporting mapping of Nodes to Attributes through REST -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6857) Support REST for Node Attributes configurations
[ https://issues.apache.org/jira/browse/YARN-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126072#comment-17126072 ] Hadoop QA commented on YARN-6857: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 26s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 21s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 47s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 46s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 32s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 5 new + 12 unchanged - 0 fixed = 17 total (was 12) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 20s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 36s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}160m 49s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26113/artifact/out/Dockerfile | | JIRA Issue | YARN-6857 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004821/YARN-6857.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 251c07ae8064 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 5157118bd7f | | Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 | | check
[jira] [Comment Edited] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126050#comment-17126050 ] Benjamin Teke edited comment on YARN-10295 at 6/4/20, 4:19 PM: --- [~adam.antal], Yes, it is indeed blocked by that isDebugEnabled check, I missed to mention in the description, that the debug level must be set, so I updated it. Some lines are purposely omitted in the log extract above, that's why it doesn't show any debug level logs. Also the log is copied from our internal testsystem - because the error is quite tricky to reproduce -, hence the incorrect line numbers in the exception. was (Author: bteke): [~adam.antal], Yes, it is indeed blocked by that isDebugEnabled check, I missed to mention that in the description, so I updated it. Some lines are purposely omitted in the log extract above, and it is copied from our internal testsystem - because the error is quite tricky to reproduce -, that's why it doesn't show any debug level logs and incorrect line numbers in the exception. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126050#comment-17126050 ] Benjamin Teke commented on YARN-10295: -- [~adam.antal], Yes, it is indeed blocked by that isDebugEnabled check, I missed to mention that in the description, so I updated it. Some lines are purposely omitted in the log extract above, and it is copied from our internal testsystem - because the error is quite tricky to reproduce -, that's why it doesn't show any debug level logs and incorrect line numbers in the exception. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled and log level > is set to DEBUG there is an edge-case where a NullPointerException can cause > the scheduler thread to exit and the apps to get stuck without allocated > resources. Consider the following log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10295: - Description: When the CapacityScheduler Asynchronous scheduling is enabled and log level is set to DEBUG there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log: {code:java} 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used= with resource= 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used=, currently has 0 at priority 11; currentReservation on node-label= 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) {code} A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], because at the time _node.getReservedContainer()_ wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime. {code:java} // Do not schedule if there are any reservations to fulfill on the node if (node.getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping scheduling since node " + node.getNodeID() + " is reserved by application " + node.getReservedContainer() .getContainerId().getApplicationAttemptId()); } return null; } {code} A fix would be to store the container object before the if block. Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 which indirectly fixed this. was: When the CapacityScheduler Asynchronous scheduling and debug logging is enabled there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log: {code:java} 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used= with resource= 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used=, currently has 0 at priority 11; currentReservation on node-label= 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(
[jira] [Updated] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Teke updated YARN-10295: - Description: When the CapacityScheduler Asynchronous scheduling and debug logging is enabled there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log: {code:java} 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used= with resource= 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used=, currently has 0 at priority 11; currentReservation on node-label= 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) {code} A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], because at the time _node.getReservedContainer()_ wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime. {code:java} // Do not schedule if there are any reservations to fulfill on the node if (node.getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping scheduling since node " + node.getNodeID() + " is reserved by application " + node.getReservedContainer() .getContainerId().getApplicationAttemptId()); } return null; } {code} A fix would be to store the container object before the if block. Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 which indirectly fixed this. was: When the CapacityScheduler Asynchronous scheduling is enabled there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log: {code:java} 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used= with resource= 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved on node host: ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 available= used=, currently has 0 at priority 11; currentReservation on node-label= 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
[jira] [Commented] (YARN-10295) CapacityScheduler NPE can cause apps to get stuck without resources
[ https://issues.apache.org/jira/browse/YARN-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125968#comment-17125968 ] Adam Antal commented on YARN-10295: --- Thanks for the investigation [~bteke]. I have a question here: the {{getReservedContainer()}} call is guarded by a {{LOG.isDebugEnabled()}} block. Therefore this issue should only happen when DEBUG logging is turned on. The extract of the log attached in the description does not show debug level logs. That makes me a little bit suspicious: are you sure the NPE happened at the exact same line? If I knew you had omitted the DEBUG lines, I'd be reassured. > CapacityScheduler NPE can cause apps to get stuck without resources > --- > > Key: YARN-10295 > URL: https://issues.apache.org/jira/browse/YARN-10295 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.0, 3.2.0 >Reporter: Benjamin Teke >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10295.001.branch-3.1.patch, > YARN-10295.001.branch-3.2.patch > > > When the CapacityScheduler Asynchronous scheduling is enabled there is an > edge-case where a NullPointerException can cause the scheduler thread to exit > and the apps to get stuck without allocated resources. Consider the following > log: > {code:java} > 2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:apply(681)) - Reserved > container=container_e10_1590502305306_0660_01_000115, on node=host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used= with > resource= > 2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:internalUnreserve(743)) - Application > application_1590502305306_0660 unreserved on node host: > ctr-e148-1588963324989-31443-01-02.hwx.site:25454 #containers=14 > available= used=, currently > has 0 at priority 11; currentReservation on node-label= > 2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted > 2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4953,5,main] threw an Exception. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593) > {code} > A container gets allocated on a host, but the host doesn't have enough > memory, so after a short while it gets unreserved. However because the > scheduler thread is running asynchronously it might have entered into the > following if block located in > [CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602], > because at the time _node.getReservedContainer()_ wasn't null. Calling it a > second time for getting the ApplicationAttemptId would be an NPE, as the > container got unreserved in the meantime. > {code:java} > // Do not schedule if there are any reservations to fulfill on the node > if (node.getReservedContainer() != null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping scheduling since node " + node.getNodeID() > + " is reserved by application " + node.getReservedContainer() > .getContainerId().getApplicationAttemptId()); > } > return null; > } > {code} > A fix would be to store the container object before the if block. > Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 > which indirectly fixed this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125949#comment-17125949 ] Peter Bacsko commented on YARN-9930: OK, no more UT failures, other than "TestFairSchedulerPreemption" which is very likely totally unrelated. Ping [~snemeth] / [~sunilg] / [~epayne] / [~maniraj...@gmail.com] for some feedback. If you guys agree with the approach outlined in https://issues.apache.org/jira/browse/YARN-9930?focusedCommentId=17118899&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17118899 and https://issues.apache.org/jira/browse/YARN-9930?focusedCommentId=17124947&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17124947, then I'll continue the development on this course. > Support max running app logic for CapacityScheduler > --- > > Key: YARN-9930 > URL: https://issues.apache.org/jira/browse/YARN-9930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, capacityscheduler >Affects Versions: 3.1.0, 3.1.1 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9930-POC01.patch, YARN-9930-POC02.patch, > YARN-9930-POC03.patch, YARN-9930-POC04.patch, YARN-9930-POC05.patch > > > In FairScheduler, there has limitation for max running which will let > application pending. > But in CapacityScheduler there has no feature like max running app.Only got > max app,and jobs will be rejected directly on client. > This jira i want to implement this semantic for CapacityScheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6857) Support REST for Node Attributes configurations
[ https://issues.apache.org/jira/browse/YARN-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125940#comment-17125940 ] Bilwa S T commented on YARN-6857: - Thanks [~Naganarasimha]. [~prabhujoseph] I have attached patch. Please review > Support REST for Node Attributes configurations > --- > > Key: YARN-6857 > URL: https://issues.apache.org/jira/browse/YARN-6857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, capacityscheduler, client >Reporter: Naganarasimha G R >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-6857-YARN-3409.001.patch, YARN-6857.002.patch > > > This jira focusses on supporting mapping of Nodes to Attributes through REST -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6857) Support REST for Node Attributes configurations
[ https://issues.apache.org/jira/browse/YARN-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-6857: Attachment: YARN-6857.002.patch > Support REST for Node Attributes configurations > --- > > Key: YARN-6857 > URL: https://issues.apache.org/jira/browse/YARN-6857 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, capacityscheduler, client >Reporter: Naganarasimha G R >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-6857-YARN-3409.001.patch, YARN-6857.002.patch > > > This jira focusses on supporting mapping of Nodes to Attributes through REST -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9930) Support max running app logic for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125909#comment-17125909 ] Hadoop QA commented on YARN-9930: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 6 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 8s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 42s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 40s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 43s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 23 new + 749 unchanged - 0 fixed = 772 total (was 749) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 15s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 43s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 48s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}157m 56s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerPreemption | \\ \\ || Subsystem || Report/Notes || | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://builds.apache.org/job/PreCommit-YARN-Build/26111/artifact/out/Dockerfile | | JIRA Issue | YARN-9930 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/13004801/YARN-9930-POC05.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 43232e1fa6b3 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 5157118bd7f | | Default Java | Private Build-1.8.0_252-8u252-b09-1
[jira] [Commented] (YARN-10279) Avoid unnecessary QueueMappingEntity creations
[ https://issues.apache.org/jira/browse/YARN-10279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125861#comment-17125861 ] Adam Antal commented on YARN-10279: --- Thanks! > Avoid unnecessary QueueMappingEntity creations > -- > > Key: YARN-10279 > URL: https://issues.apache.org/jira/browse/YARN-10279 > Project: Hadoop YARN > Issue Type: Task >Reporter: Gergely Pollak >Assignee: Bilwa S T >Priority: Minor > > In CS UserGroupMappingPlacementRule and AppNameMappingPlacementRule classes > we create new instances of QueueMappingEntity class. In some cases we simply > copy the already received class, so we just duplicate it, which is > unnecessary since the class is immutable. > This is just a minor improvement, probably doesn't have much impact, but > still puts some unnecessary load on GC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10279) Avoid unnecessary QueueMappingEntity creations
[ https://issues.apache.org/jira/browse/YARN-10279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Antal reassigned YARN-10279: - Assignee: Hudáky Márton Gyula (was: Bilwa S T) > Avoid unnecessary QueueMappingEntity creations > -- > > Key: YARN-10279 > URL: https://issues.apache.org/jira/browse/YARN-10279 > Project: Hadoop YARN > Issue Type: Task >Reporter: Gergely Pollak >Assignee: Hudáky Márton Gyula >Priority: Minor > > In CS UserGroupMappingPlacementRule and AppNameMappingPlacementRule classes > we create new instances of QueueMappingEntity class. In some cases we simply > copy the already received class, so we just duplicate it, which is > unnecessary since the class is immutable. > This is just a minor improvement, probably doesn't have much impact, but > still puts some unnecessary load on GC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10275) CapacityScheduler QueuePath object should be able to parse paths
[ https://issues.apache.org/jira/browse/YARN-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125837#comment-17125837 ] Gergely Pollak commented on YARN-10275: --- This Jira is very likely to become invalid, since during the resolution of YARN-10281 this class have been removed entirely, if that patch will get merged, I'll invalidate this Jira. > CapacityScheduler QueuePath object should be able to parse paths > > > Key: YARN-10275 > URL: https://issues.apache.org/jira/browse/YARN-10275 > Project: Hadoop YARN > Issue Type: Task >Reporter: Gergely Pollak >Assignee: Bilwa S T >Priority: Major > > Currently QueuePlacementRuleUtils has an extractQueuePath method, which is > used to split full paths to parent path +leafqueue name, all instances of > QueuePath are created via this method, this suggest this behaviour should be > part of the QueuePath object. > We should create a constructor, which implements this logic, and remove the > extractQueuePath method. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9930) Support max running app logic for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9930: --- Attachment: YARN-9930-POC05.patch > Support max running app logic for CapacityScheduler > --- > > Key: YARN-9930 > URL: https://issues.apache.org/jira/browse/YARN-9930 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, capacityscheduler >Affects Versions: 3.1.0, 3.1.1 >Reporter: zhoukang >Assignee: zhoukang >Priority: Major > Attachments: YARN-9930-POC01.patch, YARN-9930-POC02.patch, > YARN-9930-POC03.patch, YARN-9930-POC04.patch, YARN-9930-POC05.patch > > > In FairScheduler, there has limitation for max running which will let > application pending. > But in CapacityScheduler there has no feature like max running app.Only got > max app,and jobs will be rejected directly on client. > This jira i want to implement this semantic for CapacityScheduler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
[ https://issues.apache.org/jira/browse/YARN-10307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125766#comment-17125766 ] appleyuchi commented on YARN-10307: --- I run above in two-node cluster(My laptop and my desktop) do I need to install leveldb alone before start timelineserver? My master(Desktop) jps is: 15843 NodeManager 14868 NameNode 15655 ResourceManager 32684 Main 27196 Jps 15038 DataNode My slave(Laptop) jps is: 15796 DataNode 16331 Jps 15966 NodeManager > /leveldb-timeline-store.ldb/LOCK not exist > -- > > Key: YARN-10307 > URL: https://issues.apache.org/jira/browse/YARN-10307 > Project: Hadoop YARN > Issue Type: Bug > Environment: Ubuntu 19.10 > Hadoop 3.1.2 > Tez 0.9.2 > Hbase 2.2.4 >Reporter: appleyuchi >Priority: Blocker > > $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver > > in hadoop-appleyuchi-timelineserver-Desktop.out I get > > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: > /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] > 沒有此一檔案或目錄 > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,525 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state > INITED > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) > at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > 2020-06-04 17:48:21,526 INFO [main] service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer > failed in state INITED > org.apache.hadoop.service.ServiceStateException: > java.io.FileNotFoundException: Source > 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' > does not exist > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) > Cause
[jira] [Created] (YARN-10307) /leveldb-timeline-store.ldb/LOCK not exist
appleyuchi created YARN-10307: - Summary: /leveldb-timeline-store.ldb/LOCK not exist Key: YARN-10307 URL: https://issues.apache.org/jira/browse/YARN-10307 Project: Hadoop YARN Issue Type: Bug Environment: Ubuntu 19.10 Hadoop 3.1.2 Tez 0.9.2 Hbase 2.2.4 Reporter: appleyuchi $HADOOP_HOME/sbin/yarn-daemon.sh start timelineserver in hadoop-appleyuchi-timelineserver-Desktop.out I get org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /home/appleyuchi/[file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:|file:///home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb/LOCK:] 沒有此一檔案或目錄 at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:246) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) 2020-06-04 17:48:21,525 INFO [main] service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore failed in state INITED java.io.FileNotFoundException: Source 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' does not exist at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) 2020-06-04 17:48:21,526 INFO [main] service.AbstractService (AbstractService.java:noteFailure(267)) - Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED org.apache.hadoop.service.ServiceStateException: java.io.FileNotFoundException: Source 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' does not exist at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:177) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:187) Caused by: java.io.FileNotFoundException: Source 'file:/home/appleyuchi/bigdata/hadoop_tmp/yarn/timeline/leveldb-timeline-store.ldb' does not exist at org.apache.commons.io.FileUtils.checkFileRequirements(FileUtils.java:1405) at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1368) at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1268) at org.apache.commons.io.FileUtils.copyDirectory(FileUtils.java:1237) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:253) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ... 5 more 2020-06-04 17:48:21,527 INFO [main] impl.MetricsSystemImpl (Me