[ 
https://issues.apache.org/jira/browse/YARN-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338813#comment-14338813
 ] 

Naganarasimha G R commented on YARN-3260:
-----------------------------------------

Hi [~jlowe],
Had a look at the code and some approaches which i can think of are :
* ApplicationMasterService.registerAppAttempt(ApplicationAttemptId) to be 
called in RMAppAttemptImpl.AMLaunchedTransition  instead of 
RMAppAttemptImpl.AttemptStartedTransition and ensuring that ClientToAMToken and 
registerering with ApplicationMasterService in the same block. By doing this we 
can throw InvalidApplicationMasterRequestException if AM tries to register to 
AMS before RMAppAttemptImpl processes RMAppAttempt LAUNCHED event.
* Was thinking of having MultiThreadedDispatcher for processing APP and 
AppAttempt events  similar to the one  in 
SystemMetricsPublisher.MultiThreadedDispatcher with additional modification 
that instead of having {{ "(event.hashCode() & Integer.MAX_VALUE) % 
dispatchers.size();"}} we can think of doing it based on applicationId. This 
can speed up the processing of App events ...

 Was not able to see any other cleaner direct fix for this issue, so was 
wondering whether we need to start looking at the reason for "clusters was 
running behind on processing AsyncDispatcher events". Were these events were 
getting delayed to any particular reason? 

> NPE if AM attempts to register before RM processes launch event
> ---------------------------------------------------------------
>
>                 Key: YARN-3260
>                 URL: https://issues.apache.org/jira/browse/YARN-3260
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Assignee: Naganarasimha G R
>
> The RM on one of our clusters was running behind on processing 
> AsyncDispatcher events, and this caused AMs to fail to register due to an 
> NPE.  The AM was launched and attempting to register before the 
> RMAppAttemptImpl had processed the LAUNCHED event, and the client to AM token 
> had not been generated yet.  The NPE occurred because the 
> ApplicationMasterService tried to encode the missing token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to