kyungwan nam created YARN-9929: ---------------------------------- Summary: NodeManager OOM because of stuck DeletionService Key: YARN-9929 URL: https://issues.apache.org/jira/browse/YARN-9929 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.2 Reporter: kyungwan nam Assignee: kyungwan nam
NMs go through frequent Full GC due to a lack of heap memory. we can find a lot of FileDeletionTask, DockerContainerDeletionTask from the heap dump (screenshot is attached) and after analyzing the thread dump, we can figure out _DeletionService_ gets stuck in _executeStatusCommand_ which run 'docker inspect' {code:java} "DeletionService #0" - Thread t@41 java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <649fc0cf> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <3e45c938> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read1(BufferedReader.java:212) at java.io.BufferedReader.read(BufferedReader.java:286) - locked <3e45c938> (a java.io.InputStreamReader) at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1240) at org.apache.hadoop.util.Shell.runCommand(Shell.java:995) at org.apache.hadoop.util.Shell.run(Shell.java:902) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:91) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:180) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:118) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:937) at org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - locked <4cc6fa2a> (a java.util.concurrent.ThreadPoolExecutor$Worker) {code} also, we found 'docker inspect' processes are running for a long time as follows. {code:java} root 95637 0.0 0.0 2650984 35776 ? Sl Aug23 5:48 /usr/bin/docker inspect --format={{.State.Status}} container_e30_1555419799458_0014_01_000030 root 95638 0.0 0.0 2773860 33908 ? Sl Aug23 5:33 /usr/bin/docker inspect --format={{.State.Status}} container_e50_1561100493387_25316_01_001455 root 95641 0.0 0.0 2445924 34204 ? Sl Aug23 5:34 /usr/bin/docker inspect --format={{.State.Status}} container_e49_1560851258686_2107_01_000024 root 95643 0.0 0.0 2642532 34428 ? Sl Aug23 5:30 /usr/bin/docker inspect --format={{.State.Status}} container_e50_1561100493387_8111_01_002657{code} I think It has occurred since docker daemon is restarted. 'docker inspect' which was run while restarting the docker daemon was not working. and not even it was not terminated. It can be considered as a docker issue. but It could happen whenever if 'docker inspect' does not work due to docker daemon restarting or docker bug. It would be good to set the timeout for 'docker inspect' to avoid this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org