[ https://issues.apache.org/jira/browse/YARN-9929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kyungwan nam updated YARN-9929: ------------------------------- Attachment: nm_heapdump.png > NodeManager OOM because of stuck DeletionService > ------------------------------------------------ > > Key: YARN-9929 > URL: https://issues.apache.org/jira/browse/YARN-9929 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.1.2 > Reporter: kyungwan nam > Assignee: kyungwan nam > Priority: Major > Attachments: nm_heapdump.png > > > NMs go through frequent Full GC due to a lack of heap memory. > we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the > heap dump (screenshot is attached) > and after analyzing the thread dump, we can figure out _DeletionService_ gets > stuck in _executeStatusCommand_ which run 'docker inspect' > {code:java} > "DeletionService #0" - Thread t@41 > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <3e45c938> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > - locked <3e45c938> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:995) > at org.apache.hadoop.util.Shell.run(Shell.java:902) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Locked ownable synchronizers: > - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) > {code} > also, we found 'docker inspect' processes are running for a long time as > follows. > {code:java} > root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48 > /usr/bin/docker inspect --format={{.State.Status}} > container_e30_1555419799458_0014_01_000030 > root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_25316_01_001455 > root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34 > /usr/bin/docker inspect --format={{.State.Status}} > container_e49_1560851258686_2107_01_000024 > root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30 > /usr/bin/docker inspect --format={{.State.Status}} > container_e50_1561100493387_8111_01_002657{code} > > I think It has occurred since docker daemon is restarted. > 'docker inspect' which was run while restarting the docker daemon was not > working. and not even it was not terminated. > It can be considered as a docker issue. > but It could happen whenever if 'docker inspect' does not work due to docker > daemon restarting or docker bug. > It would be good to set the timeout for 'docker inspect' to avoid this issue. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org