[ 
https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-6714:
---------------------------
    Attachment: YARN-6714.003.patch

Update the patch with adding comments for sanity check of attemptId. Thanks 
[~sunilg] for your suggestion.

> RM crashed with IllegalStateException while handling APP_ATTEMPT_REMOVED 
> event when async-scheduling enabled in CapacityScheduler
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6714
>                 URL: https://issues.apache.org/jira/browse/YARN-6714
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.9.0, 3.0.0-alpha3
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>         Attachments: YARN-6714.001.patch, YARN-6714.002.patch, 
> YARN-6714.003.patch
>
>
> Currently in async-scheduling mode of CapacityScheduler, after AM failover 
> and unreserve all reserved containers, it still have chance to get and commit 
> the outdated reserve proposal of the failed app attempt. This problem 
> happened on an app in our cluster, when this app stopped, it unreserved all 
> reserved containers and compared these appAttemptId with current 
> appAttemptId, if not match it will throw IllegalStateException and make RM 
> crashed.
> Error log:
> {noformat}
> 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type APP_ATTEMPT_REMOVED to the scheduler
> java.lang.IllegalStateException: Trying to unreserve  for application 
> appattempt_1495188831758_0121_000002 when currently reserved  for application 
> application_1495188831758_0121 on node host: node1:45454 #containers=2 
> available=... used=...
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822)
>         at java.lang.Thread.run(Thread.java:834)
> {noformat}
> When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and 
> CapacityScheduler#tryCommit both need to get write_lock before executing, so 
> we can check the app attempt state in commit process to avoid committing 
> outdated proposals.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to