[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544742#comment-16544742 ]
Chen Yufei edited comment on YARN-8513 at 7/16/18 2:39 AM: ----------------------------------------------------------- [~yuanbo] I've uploaded jstack and top log when the problem appeared yesterday. jstack log is captured for 5 times thus 5 log files. [^top-during-lock.log] is captured when RM is not responding to requests. [^top-when-normal.log] is captured today and RM is running normally. was (Author: cyfdecyf): [~yuanbo] I've uploaded jstack and top log when the problem appears yesterday. jstack log are captured for 5 times thus 5 log files. [^top-during-lock.log] is captured when RM is not responding to requests. [^top-when-normal.log] is captured today and RM is running normally. > CapacityScheduler infinite loop when queue is near fully utilized > ----------------------------------------------------------------- > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn > Affects Versions: 2.9.1 > Environment: Ubuntu 14.04.5 > YARN is configured with one label and 5 queues. > Reporter: Chen Yufei > Priority: Major > Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, > jstack-5.log, top-during-lock.log, top-when-normal.log > > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used=<memory:16170624, vCores:1577> > cluster=<memory:29441544, vCores:5792>}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_000001 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource=<memory:29441544, vCores:5792> type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org