[ https://issues.apache.org/jira/browse/YARN-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prabhu Joseph updated YARN-7426: -------------------------------- Summary: Interrupt does not work when LocalizerRunner is reading from InputStream (was: Add a finite shell command timeout to ContainerLocalizer) > Interrupt does not work when LocalizerRunner is reading from InputStream > ------------------------------------------------------------------------ > > Key: YARN-7426 > URL: https://issues.apache.org/jira/browse/YARN-7426 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Affects Versions: 2.7.3 > Reporter: Prabhu Joseph > Priority: Critical > > When the NodeManager is overloaded and ContainerLocalizer processes are > hanging, the containers will timeout and cleaned up. The LocalizerRunner > thread will be interrupted during cleanup but the interrupt does not work > when it is reading from FileInputStream. LocalizerRunner threads and > ContainerLocalizer process keeps on accumulating which makes the node > completely unresponsive. We can have a timeout for Shell Command to avoid > this similar to HADOOP-13817. > The timeout value can be set by AM same as container timeout. > ContainerLocalizer JVM stacktrace: > {code} > "main" #1 prio=5 os_prio=0 tid=0x00007fd8ec019000 nid=0xc295 runnable > [0x00007fd8f3956000] > java.lang.Thread.State: RUNNABLE > at java.util.zip.ZipFile.open(Native Method) > at java.util.zip.ZipFile.<init>(ZipFile.java:219) > at java.util.zip.ZipFile.<init>(ZipFile.java:149) > at java.util.jar.JarFile.<init>(JarFile.java:166) > at java.util.jar.JarFile.<init>(JarFile.java:103) > at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893) > at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756) > at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838) > at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831) > at java.security.AccessController.doPrivileged(Native Method) > at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830) > at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:803) > at sun.misc.URLClassPath$3.run(URLClassPath.java:530) > at sun.misc.URLClassPath$3.run(URLClassPath.java:520) > at java.security.AccessController.doPrivileged(Native Method) > at sun.misc.URLClassPath.getLoader(URLClassPath.java:519) > at sun.misc.URLClassPath.getLoader(URLClassPath.java:492) > - locked <0x000000076ac75058> (a sun.misc.URLClassPath) > at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457) > - locked <0x000000076ac75058> (a sun.misc.URLClassPath) > at sun.misc.URLClassPath.getResource(URLClassPath.java:211) > at java.net.URLClassLoader$1.run(URLClassLoader.java:365) > at java.net.URLClassLoader$1.run(URLClassLoader.java:362) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:361) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > - locked <0x000000076ac7f960> (a java.lang.Object) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495) > {code} > NodeManager LocalizerRunner thread which is not interrupted: > {code} > "LocalizerRunner for container_e746_1508665985104_601806_01_000005" #3932753 > prio=5 os_prio=0 tid=0x00007fb258d5f800 nid=0x11091 runnable > [0x00007fb153946000] > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0000000718502b80> (a > java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <0x0000000718502bd8> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.read1(BufferedReader.java:212) > at java.io.BufferedReader.read(BufferedReader.java:286) > - locked <0x0000000718502bd8> (a java.io.InputStreamReader) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:930) > at org.apache.hadoop.util.Shell.run(Shell.java:848) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114) > NM log shows the LocalizerRunner is suppose to > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org