[ https://issues.apache.org/jira/browse/YARN-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338813#comment-14338813 ]
Naganarasimha G R commented on YARN-3260: ----------------------------------------- Hi [~jlowe], Had a look at the code and some approaches which i can think of are : * ApplicationMasterService.registerAppAttempt(ApplicationAttemptId) to be called in RMAppAttemptImpl.AMLaunchedTransition instead of RMAppAttemptImpl.AttemptStartedTransition and ensuring that ClientToAMToken and registerering with ApplicationMasterService in the same block. By doing this we can throw InvalidApplicationMasterRequestException if AM tries to register to AMS before RMAppAttemptImpl processes RMAppAttempt LAUNCHED event. * Was thinking of having MultiThreadedDispatcher for processing APP and AppAttempt events similar to the one in SystemMetricsPublisher.MultiThreadedDispatcher with additional modification that instead of having {{ "(event.hashCode() & Integer.MAX_VALUE) % dispatchers.size();"}} we can think of doing it based on applicationId. This can speed up the processing of App events ... Was not able to see any other cleaner direct fix for this issue, so was wondering whether we need to start looking at the reason for "clusters was running behind on processing AsyncDispatcher events". Were these events were getting delayed to any particular reason? > NPE if AM attempts to register before RM processes launch event > --------------------------------------------------------------- > > Key: YARN-3260 > URL: https://issues.apache.org/jira/browse/YARN-3260 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Jason Lowe > Assignee: Naganarasimha G R > > The RM on one of our clusters was running behind on processing > AsyncDispatcher events, and this caused AMs to fail to register due to an > NPE. The AM was launched and attempting to register before the > RMAppAttemptImpl had processed the LAUNCHED event, and the client to AM token > had not been generated yet. The NPE occurred because the > ApplicationMasterService tried to encode the missing token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)