[ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055413#comment-16055413 ]
Tao Yang edited comment on YARN-6714 at 6/20/17 9:38 AM: --------------------------------------------------------- Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} I remembered why these test cases are not in TestCapacitySchedulerAsyncScheduling before, these cases is complex and hard to reproduce when async-scheduling enabled. Can I move these test cases to TestCapacitySchedulerAsyncScheduling but not enable async-scheduling ? {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D was (Author: tao yang): Thanks [~leftnoteasy] for reviewing the patch. {quote} Could you move test case from TestCapacityScheduler to TestCapacitySchedulerAsyncScheduling (same comment to YARN-6678 as well). {quote} Sure, I will update the patch later for this and YARN-6678. {quote} could you file a separate JIRA for that? (And welcome if you can work on that ). {quote} I'm glad to work on that :D > RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED > event when async-scheduling enabled in CapacityScheduler > --------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.9.0, 3.0.0-alpha3 > Reporter: Tao Yang > Assignee: Tao Yang > Attachments: YARN-6714.001.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover > and unreserve all reserved containers, it still have chance to get and commit > the outdated reserve proposal of the failed app attempt. This problem > happened on an app in our cluster, when this app stopped, it unreserved all > reserved containers and compared these appAttemptId with current > appAttemptId, if not match it will throw IllegalStateException and make RM > crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application > appattempt_1495188831758_0121_000002 when currently reserved for application > application_1495188831758_0121 on node host: node1:45454 #containers=2 > available=... used=... > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and > CapacityScheduler#tryCommit both need to get write_lock before executing, so > we can check the app attempt state in commit process to avoid committing > outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org