Prabhu Joseph created YARN-7426: ----------------------------------- Summary: Add a finite shell command timeout to ContainerLocalizer Key: YARN-7426 URL: https://issues.apache.org/jira/browse/YARN-7426 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.3 Reporter: Prabhu Joseph Priority: Critical
When the NodeManager is overloaded and ContainerLocalizer processes are hanging, the containers will timeout and cleaned up. The LocalizerRunner thread will be interrupted during cleanup but the interrupt does not work when it is reading from FileInputStream. LocalizerRunner threads and ContainerLocalizer process keeps on accumulating which makes the node completely unresponsive. We can have a timeout for Shell Command to avoid this similar to HADOOP-13817. The timeout value can be set by AM same as container timeout. ContainerLocalizer JVM stacktrace: {code} "main" #1 prio=5 os_prio=0 tid=0x00007fd8ec019000 nid=0xc295 runnable [0x00007fd8f3956000] java.lang.Thread.State: RUNNABLE at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.<init>(ZipFile.java:219) at java.util.zip.ZipFile.<init>(ZipFile.java:149) at java.util.jar.JarFile.<init>(JarFile.java:166) at java.util.jar.JarFile.<init>(JarFile.java:103) at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893) at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756) at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838) at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831) at java.security.AccessController.doPrivileged(Native Method) at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830) at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:803) at sun.misc.URLClassPath$3.run(URLClassPath.java:530) at sun.misc.URLClassPath$3.run(URLClassPath.java:520) at java.security.AccessController.doPrivileged(Native Method) at sun.misc.URLClassPath.getLoader(URLClassPath.java:519) at sun.misc.URLClassPath.getLoader(URLClassPath.java:492) - locked <0x000000076ac75058> (a sun.misc.URLClassPath) at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457) - locked <0x000000076ac75058> (a sun.misc.URLClassPath) at sun.misc.URLClassPath.getResource(URLClassPath.java:211) at java.net.URLClassLoader$1.run(URLClassLoader.java:365) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) - locked <0x000000076ac7f960> (a java.lang.Object) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495) {code} NodeManager LocalizerRunner thread which is not interrupted: {code} "LocalizerRunner for container_e746_1508665985104_601806_01_000005" #3932753 prio=5 os_prio=0 tid=0x00007fb258d5f800 nid=0x11091 runnable [0x00007fb153946000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x0000000718502b80> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <0x0000000718502bd8> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read1(BufferedReader.java:212) at java.io.BufferedReader.read(BufferedReader.java:286) - locked <0x0000000718502bd8> (a java.io.InputStreamReader) at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155) at org.apache.hadoop.util.Shell.runCommand(Shell.java:930) at org.apache.hadoop.util.Shell.run(Shell.java:848) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) NM log shows the LocalizerRunner is suppose to {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org