Prabhu Joseph created YARN-7426:
-----------------------------------

             Summary: Add a finite shell command timeout to ContainerLocalizer
                 Key: YARN-7426
                 URL: https://issues.apache.org/jira/browse/YARN-7426
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn
    Affects Versions: 2.7.3
            Reporter: Prabhu Joseph
            Priority: Critical


When the NodeManager is overloaded and ContainerLocalizer processes are 
hanging, the containers will timeout and cleaned up. The LocalizerRunner thread 
will be interrupted during cleanup but the interrupt does not work when it is 
reading from FileInputStream. LocalizerRunner threads and ContainerLocalizer 
process keeps on accumulating which makes the node completely unresponsive. We 
can have a timeout for Shell Command to avoid this similar to HADOOP-13817.
The timeout value can be set by AM same as container timeout.

ContainerLocalizer JVM stacktrace:

{code}
"main" #1 prio=5 os_prio=0 tid=0x00007fd8ec019000 nid=0xc295 runnable 
[0x00007fd8f3956000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.ZipFile.open(Native Method)
        at java.util.zip.ZipFile.<init>(ZipFile.java:219)
        at java.util.zip.ZipFile.<init>(ZipFile.java:149)
        at java.util.jar.JarFile.<init>(JarFile.java:166)
        at java.util.jar.JarFile.<init>(JarFile.java:103)
        at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893)
        at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
        at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
        at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
        at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:803)
        at sun.misc.URLClassPath$3.run(URLClassPath.java:530)
        at sun.misc.URLClassPath$3.run(URLClassPath.java:520)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
        at sun.misc.URLClassPath.getLoader(URLClassPath.java:492)
        - locked <0x000000076ac75058> (a sun.misc.URLClassPath)
        at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457)
        - locked <0x000000076ac75058> (a sun.misc.URLClassPath)
        at sun.misc.URLClassPath.getResource(URLClassPath.java:211)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        - locked <0x000000076ac7f960> (a java.lang.Object)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
{code}

NodeManager LocalizerRunner thread which is not interrupted:

{code}
"LocalizerRunner for container_e746_1508665985104_601806_01_000005" #3932753 
prio=5 os_prio=0 tid=0x00007fb258d5f800 nid=0x11091 runnable 
[0x00007fb153946000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked <0x0000000718502b80> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        - locked <0x0000000718502bd8> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        - locked <0x0000000718502bd8> (a java.io.InputStreamReader)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:930)
        at org.apache.hadoop.util.Shell.run(Shell.java:848)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
        at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
NM log shows the LocalizerRunner is suppose to 
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to