[ https://issues.apache.org/jira/browse/YARN-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15642368#comment-15642368 ]
Varun Saxena commented on YARN-4355: ------------------------------------ bq. Also cant we pass the same tracker instance to getPathForLocalization from LocalizerRunner.processHeartbeat ? That is what is being done. Previously tracker was fetched inside getPathForLocalization. Do you mean something else ? {code} 1119 LocalResourcesTracker tracker = getLocalResourcesTracker( 1120 next.getVisibility(), user, applicationId); 1121 if (tracker != null) { 1122 ResourceLocalizationSpec resource = 1123 NodeManagerBuilderUtils.newResourceLocalizationSpec(next, 1124 getPathForLocalization(next, tracker)); 1125 rsrcs.add(resource); 1126 } {code} bq. If cleanup has happened then are there chances of having pending LocalizerResourceRequestEvent in LocalizerRunner ? It can even if rarely. This section of code isn't really synchronized. So event processing for cleaning up container resources and destroying application resources can happen before localizer HB is fully processed. Localizer will only DIE if it cannot find the localizer in list of localizers which is removed when container is cleaned up. But it is possible that HB carries on for processing if it finds the localizer but later does not find the tracker as application resources are later destroyed before HB is fully processed due to the corresponding sections are not guarded by lock. Evidence of this is the NPE reported in JIRA which came on a real cluster. This NPE came when NM was shutting down so all the apps on the NM were being cleaned up as well. So yes there can be chances of having pending events. And anyways having a null check is not a bad thing to do. Why did I not synchronize this section of code as a solution. Well the reason was that the possibility of this race happening is very rare. > NPE while processing localizer heartbeat > ---------------------------------------- > > Key: YARN-4355 > URL: https://issues.apache.org/jira/browse/YARN-4355 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.7.2 > Reporter: Jason Lowe > Assignee: Varun Saxena > Attachments: YARN-4355.01.patch, YARN-4355.02.patch, > YARN-4355.03.patch, YARN-4355.04.patch > > > While analyzing YARN-4354 I noticed a nodemanager was getting NPEs while > processing a private localizer heartbeat. I think there's a race where we > can cleanup resources for an application and therefore remove the app local > resource tracker just as we are trying to handle the localizer heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org