[ 
https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199034#comment-17199034
 ] 

shilongfei commented on YARN-9192:
----------------------------------

Hi, [~rayman7718], thank you for the conf, I have set these, container is not 
exit when restarting NM by supervisor.

Now,I encountered a different situation.,I set 
yarn.nodemanager.delete.debug-delay-sec=600s, through the Debug log, I get the 
following information

1 -> Begin, only one container1 of app1 is running on NM1

2 -> After a while, when container1 is finished, app1 dir (such as 
/xxx/nmPrivate/app1) will to be cleanup after 600s

(After about 10s, I restart NM1, I don't know if this has any effect on the 
result

3 -> After about 300s, container2 of app1 is allocated on NM1, and file 
/xxx/nmPrivae/app1/container2/container2.pid is crated

4 -> After the end of container1 600s, the file  
/xxx/nmPrivae/app1/container2/container2.pid is deleted

the container2 never exit, cleanupContainer got stuck, ContainerManagerImpl got 
stuck, NM got stuck

 

> Deletion Taks will be picked up to delete running containers
> ------------------------------------------------------------
>
>                 Key: YARN-9192
>                 URL: https://issues.apache.org/jira/browse/YARN-9192
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications
>    Affects Versions: 2.9.1
>            Reporter: Sihai Ke
>            Priority: Major
>
> I suspect there is a bug in Yarn deletion task service, below is my repo 
> steps:
>  # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means 
> when the app finished, the Binary/container folder will be deleted after 3600 
> seconds.
>  # when the application App1 (long running service) is running on machine 
> machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be 
> called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and 
> ApplicationFinishEvent will be sent, and then some delection tasks will be 
> created, but be stored in DB and will be picked up to execute 3600 seconds.
>  # 100 seconds later, machine1 comes back, and the same app is assigned to 
> run this this machine, container created and works well.
>  # then deleting task created in step 2 will be picked up to delete 
> containers created in step 3 later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to