[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015435#comment-17015435 ]
Wilfred Spiegelenburg commented on YARN-7913: --------------------------------------------- Thank you [~snemeth] for looking at the patch 1) yes it is a different approach: we now do a proper restore instead of failing the app and or crashing the RM 2) removed that line it does not make sense to throw errors and fail the app but not fail the RM 3) You are correct in your assumption about the flag. It is checked in the case you are mentioning. This is the full change in that area: {code} 487 if (!isAppRecovering) { 488 rejectApplicationWithMessage(applicationId, 489 queueName + " is not a leaf queue"); 490 return; 491 } 492 // app is recovering we do not want to fail the app now as it was there 493 // before we started the recovery. Add it to the recovery queue: 494 // dynamic queue directly under root, no ACL needed (auto clean up) 495 queueName = "root.recovery"; 496 queue = queueMgr.getLeafQueue(queueName, true, applicationId); {code} The flag is already checked in line 487 of the change. When we get to the create of the recovery queue (line 492-496) we are recovering, if we are not recovering we have rejected the app and left the method using the return in line 490. 4) will fix the comment 5) Yes we only want to check newly added apps that are _not_ recovering. When we recover there is a large chance that there are no NMs registered. When we use the % resource setup for the maximum size of a queue then the size check of the AM resource will fail. Since the AM was/is already running we need to only perform that check if we are not recovering the app. 6) The largest number of statements inside an if..else I can see is 6 lines. Not sure if refactoring the method will make it any clearer. I also don't see any consistency in the checks that would allow us to change the flow easily and improve clarity. > Improve error handling when application recovery fails with exception > --------------------------------------------------------------------- > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 3.0.0 > Reporter: Gergo Repas > Assignee: Wilfred Spiegelenburg > Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org