[ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015009#comment-17015009 ]
Szilard Nemeth edited comment on YARN-7913 at 1/14/20 11:06 AM: ---------------------------------------------------------------- Hi [~wilfreds], Has read through all comments and checked your patch. 1. IIUC, your approach is a new one compared to Gergo's previous PoC patch, right? 2. If I was just checking the description: "The point of this ticket is to improve the error handling and reduce the number of passive -> active RM transition attempts (solving the above described failure scenario isn't in scope).", it was not in-line with the patch you uploaded. I think it would be better to remove this statement or add something to the desription that describes what has been done recently with your patch. This way, we could make a better experience of future readers of this jira. What do you think? 3. FairScheduler#addApplication is invoked by processing the SchedulerEventType#APP_ADDED event. This event type is used by AppAddedSchedulerEvent. The calls to constructors of this event class from RMAppImpl are happening when an application is added or recovered. I can see the following calls in RmAppImpl: {code} //1. app.handler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, false, app.applicationPriority, app.placementContext)); //2. app.scheduler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, false, app.applicationPriority, app.placementContext)); //3. // Add application to scheduler synchronously to guarantee scheduler // knows applications before AM or NM re-registers. app.scheduler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, true, app.applicationPriority, app.placementContext)); {code} The boolean parameter is set to isAppRecovering field of the event, so this is why I came to the conclusion that apps being created or recovered are both sent to the event dispatcher. So based on the above statement, the code you added to FairScheduler#addApplication seems a bit incorrect: {code} // app is recovering we do not want to fail the app now as it was there // before we started the recovery. Add it to the recovery queue: // dynamic queue directly under root, no ACL needed (auto clean up) queueName = "root.recovery"; queue = queueMgr.getLeafQueue(queueName, true, applicationId); {code} Shouldn't we need to check the isAppRecovering flag and only run this code block if the flag is true, meaning the app is recovering? 4. Nit: The comment has a missing comma at least: {code} // Skip ACL check for recovering applications: they have been accepted // in the queue already recovery should not break that. {code} 5. One thing I don't fully understand: {code} // when recovering the NMs might not have registered and we could have // no resources in the queue, the app is already running and has thus // passed all these checks, skip them now. if (!isAppRecovering && rmApp != null && rmApp.getAMResourceRequests() != null) { .... {code} Is this condition for checking the apps that are NOT recovering? 6. I can imagine some improvement that could be filed as a follow-up jira: I was having a bit hard time to have an understanding of the conditions inside FairScheduler#addApplication. For example, the statements inside the if-else conditions could be extracted to methods with meaningful names, so code readers would understand what happens, right away by reading the code. If they are interested in details of a particular case, they could open the method anyway. Can you file a jira for this? thanks! was (Author: snemeth): Hi [~wilfreds], Has read through all comments and checked your patch. 1. IIUC, your approach is a new one compared to Gergo's previous PoC patch, right? 2. If I was just checking the description: "The point of this ticket is to improve the error handling and reduce the number of passive -> active RM transition attempts (solving the above described failure scenario isn't in scope).", it was not in-line with the patch you uploaded. I think it would be better to remove this statement or add something to the desription that describes what has been done recently with your patch. This way, we could make a better experience of future readers of this jira. What do you think? 3. FairScheduler#addApplication is invoked by processing the SchedulerEventType#APP_ADDED event. This event type is used by AppAddedSchedulerEvent. The calls to constructors of this event class from RMAppImpl are happening when an application is added or recovered. I can see the following calls in RmAppImpl: {code} 1. app.handler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, false, app.applicationPriority, app.placementContext)); 2. app.scheduler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, false, app.applicationPriority, app.placementContext)); 3. // Add application to scheduler synchronously to guarantee scheduler // knows applications before AM or NM re-registers. app.scheduler.handle( new AppAddedSchedulerEvent(app.user, app.submissionContext, true, app.applicationPriority, app.placementContext)); {code} The boolean parameter is set to isAppRecovering field of the event, so this is why I came to the conclusion that apps being created or recovered are both sent to the event dispatcher. So based on the above statement, the code you added to FairScheduler#addApplication seems a bit incorrect: {code} // app is recovering we do not want to fail the app now as it was there // before we started the recovery. Add it to the recovery queue: // dynamic queue directly under root, no ACL needed (auto clean up) queueName = "root.recovery"; queue = queueMgr.getLeafQueue(queueName, true, applicationId); {code} Shouldn't we need to check the isAppRecovering flag and only run this code block if the flag is true, meaning the app is recovering? 4. Nit: The comment has a missing comma at least: {code} // Skip ACL check for recovering applications: they have been accepted // in the queue already recovery should not break that. {code} 5. One thing I don't fully understand: {code} // when recovering the NMs might not have registered and we could have // no resources in the queue, the app is already running and has thus // passed all these checks, skip them now. if (!isAppRecovering && rmApp != null && rmApp.getAMResourceRequests() != null) { .... {code} Is this condition for checking the apps that are NOT recovering? 6. I can imagine some improvement that could be filed as a follow-up jira: I was having a bit hard time to have an understanding of the conditions inside FairScheduler#addApplication. For example, the statements inside the if-else conditions could be extracted to methods with meaningful names, so code readers would understand what happens, right away by reading the code. If they are interested in details of a particular case, they could open the method anyway. Can you file a jira for this? thanks! > Improve error handling when application recovery fails with exception > --------------------------------------------------------------------- > > Key: YARN-7913 > URL: https://issues.apache.org/jira/browse/YARN-7913 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 3.0.0 > Reporter: Gergo Repas > Assignee: Wilfred Spiegelenburg > Priority: Major > Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, > YARN-7913.002.patch, YARN-7913.003.patch > > > There are edge cases when the application recovery fails with an exception. > Example failure scenario: > * setup: a queue is a leaf queue in the primary RM's config and the same > queue is a parent queue in the secondary RM's config. > * When failover happens with this setup, the recovery will fail for > applications on this queue, and an APP_REJECTED event will be dispatched to > the async dispatcher. On the same thread (that handles the recovery), a > NullPointerException is thrown when the applicationAttempt is tried to be > recovered > (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494). > I don't see a good way to avoid the NPE in this scenario, because when the > NPE occurs the APP_REJECTED has not been processed yet, and we don't know > that the application recovery failed. > Currently the first exception will abort the recovery, and if there are X > applications, there will be ~X passive -> active RM transition attempts - the > passive -> active RM transition will only succeed when the last APP_REJECTED > event is processed on the async dispatcher thread. > _The point of this ticket is to improve the error handling and reduce the > number of passive -> active RM transition attempts (solving the above > described failure scenario isn't in scope)._ -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org