[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973352#comment-16973352
 ] 

Tarun Parimi edited comment on YARN-9968 at 11/13/19 1:56 PM:
--------------------------------------------------------------

[~snemeth], I was finally able reproduce it artificially in my test cluster. I 
added the below the sleep and subsequent exception in FSDownload class to 
simulate the hdfs not responding for a minute and then throwing the exception 
while trying to download. When the application which requested the resource 
gets killed during the minute when the thread sleeps, I got null pointer issue 
and public localizer exited.

{code:java}
      try {
        Thread.sleep(60000);
         throw new ExecutionException("Test", new IOException("Exception"));
      } catch (InterruptedException e) {
        throw new IOException(e);
      }
{code}
>From this I understood that the issue occurs when the below sequence of events 
>occur,

1. The public localizer is waiting on the download of a file from hdfs for 
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping. 
Due to this the app cleanup is triggered, which removes the 
LocalResourcesTracker for that app.

{code:java}
  private void handleDestroyApplicationResources(Application application) {
    String userName = application.getUser();
    ApplicationId appId = application.getAppId();
    String appIDStr = application.toString();
    LocalResourcesTracker appLocalRsrcsTracker =
      appRsrc.remove(appId.toString());
{code}

3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in 
below code as tracker is null . This causes public localizer to exit and not 
handle future localization requests.
{code:java}
              tracker.handle(new ResourceFailedLocalizationEvent(
                  assoc.getResource().getRequest(), diagnostics));
{code}

This issue is introduced due to the changes in YARN-8403 , where the failed 
localization is notified to the app for logging in the AM.

I think handling a null check and preventing this should be safe as the AM is 
already killed in this scenario. Will provide an initial patch based on this.

cc [~prabhujoseph]




was (Author: tarunparimi):
[~snemeth], I was finally able reproduce it artificially in my test cluster. I 
added the below the sleep and subsequent exception in FSDownload class to 
simulate the hdfs not responding for a minute and then throwing the exception 
while trying to download. When the application which requested the resource 
gets killed during the minute when the thread sleeps, I got null pointer issue 
and public localizer exited.

{code:java}
      try {
        Thread.sleep(60000);
         throw new ExecutionException("Test", new IOException("Exception"));
      } catch (InterruptedException e) {
        throw new IOException(e);
      }

>From this I understood that the issue occurs when the below sequence of events 
>occur,

1. The public localizer is waiting on the download of a file from hdfs for 
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping. 
Due to this the app cleanup is triggered, which removes the 
LocalResourcesTracker for that app.

{code:java}
  private void handleDestroyApplicationResources(Application application) {
    String userName = application.getUser();
    ApplicationId appId = application.getAppId();
    String appIDStr = application.toString();
    LocalResourcesTracker appLocalRsrcsTracker =
      appRsrc.remove(appId.toString());
{code}

3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in 
below code as tracker is null . This causes public localizer to exit and not 
handle future localization requests.
{code:java}
              tracker.handle(new ResourceFailedLocalizationEvent(
                  assoc.getResource().getRequest(), diagnostics));
{code}

This issue is introduced due to the changes in YARN-8403 , where the failed 
localization is notified to the app for logging in the AM.

I think handling a null check and preventing this should be safe as the AM is 
already killed in this scenario. Will provide an initial patch based on this.

cc [~prabhujoseph]



> Public Localizer is exiting in NodeManager due to NullPointerException
> ----------------------------------------------------------------------
>
>                 Key: YARN-9968
>                 URL: https://issues.apache.org/jira/browse/YARN-9968
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.1.0
>            Reporter: Tarun Parimi
>            Assignee: Tarun Parimi
>            Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
>         at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>         at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>         at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to