[ 
https://issues.apache.org/jira/browse/YARN-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15642368#comment-15642368
 ] 

Varun Saxena commented on YARN-4355:
------------------------------------

bq. Also cant we pass the same tracker instance to getPathForLocalization from 
LocalizerRunner.processHeartbeat ?
That is what is being done. Previously tracker was fetched inside 
getPathForLocalization. Do you mean something else ?
{code}
1119              LocalResourcesTracker tracker = getLocalResourcesTracker(
1120                  next.getVisibility(), user, applicationId);
1121              if (tracker != null) {
1122                ResourceLocalizationSpec resource =
1123                    
NodeManagerBuilderUtils.newResourceLocalizationSpec(next,
1124                    getPathForLocalization(next, tracker));
1125                rsrcs.add(resource);
1126              }
{code}

bq. If cleanup has happened then are there chances of having pending 
LocalizerResourceRequestEvent in LocalizerRunner ?
It can even if rarely. This section of code isn't really synchronized. So event 
processing for cleaning up container resources and destroying application 
resources can happen before localizer HB is fully processed. Localizer will 
only DIE if it cannot find the localizer in list of localizers which is removed 
when container is cleaned up. But it is possible that HB carries on for 
processing if it finds the localizer but later does not find the tracker as 
application resources are later destroyed before HB is fully processed due to 
the corresponding sections are not guarded by lock. Evidence of this is the NPE 
reported in JIRA which came on a real cluster. This NPE came when NM was 
shutting down so all the apps on the NM were being cleaned up as well. So yes 
there can be chances of having pending events. And anyways having a null check 
is not a bad thing to do.
Why did I not synchronize this section of code as a solution. Well the reason 
was that the possibility of this race happening is very rare.

> NPE while processing localizer heartbeat
> ----------------------------------------
>
>                 Key: YARN-4355
>                 URL: https://issues.apache.org/jira/browse/YARN-4355
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.2
>            Reporter: Jason Lowe
>            Assignee: Varun Saxena
>         Attachments: YARN-4355.01.patch, YARN-4355.02.patch, 
> YARN-4355.03.patch, YARN-4355.04.patch
>
>
> While analyzing YARN-4354 I noticed a nodemanager was getting NPEs while 
> processing a private localizer heartbeat.  I think there's a race where we 
> can cleanup resources for an application and therefore remove the app local 
> resource tracker just as we are trying to handle the localizer heartbeat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to