[ https://issues.apache.org/jira/browse/YARN-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199034#comment-17199034 ]
shilongfei commented on YARN-9192: ---------------------------------- Hi, [~rayman7718], thank you for the conf, I have set these, container is not exit when restarting NM by supervisor. Now,I encountered a different situation.,I set yarn.nodemanager.delete.debug-delay-sec=600s, through the Debug log, I get the following information 1 -> Begin, only one container1 of app1 is running on NM1 2 -> After a while, when container1 is finished, app1 dir (such as /xxx/nmPrivate/app1) will to be cleanup after 600s (After about 10s, I restart NM1, I don't know if this has any effect on the result 3 -> After about 300s, container2 of app1 is allocated on NM1, and file /xxx/nmPrivae/app1/container2/container2.pid is crated 4 -> After the end of container1 600s, the file /xxx/nmPrivae/app1/container2/container2.pid is deleted the container2 never exit, cleanupContainer got stuck, ContainerManagerImpl got stuck, NM got stuck > Deletion Taks will be picked up to delete running containers > ------------------------------------------------------------ > > Key: YARN-9192 > URL: https://issues.apache.org/jira/browse/YARN-9192 > Project: Hadoop YARN > Issue Type: Bug > Components: applications > Affects Versions: 2.9.1 > Reporter: Sihai Ke > Priority: Major > > I suspect there is a bug in Yarn deletion task service, below is my repo > steps: > # First let's set yarn.nodemanager.delete.debug-delay-sec=3600, that means > when the app finished, the Binary/container folder will be deleted after 3600 > seconds. > # when the application App1 (long running service) is running on machine > machine1, and machine1 shutdown, ContainerManagerImpl#serviceStop() will be > called -> ContainerManagerImpl#cleanUpApplicationsOnNMShutDown, and > ApplicationFinishEvent will be sent, and then some delection tasks will be > created, but be stored in DB and will be picked up to execute 3600 seconds. > # 100 seconds later, machine1 comes back, and the same app is assigned to > run this this machine, container created and works well. > # then deleting task created in step 2 will be picked up to delete > containers created in step 3 later. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org