[ https://issues.apache.org/jira/browse/YARN-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496398#comment-15496398 ]
Eric Badger commented on YARN-5641: ----------------------------------- [~jlowe] and I worked on this for some time yesterday and killing the spawned untar shell process is proving to be very difficult. The localizer spawns up the untar shell thread, which invokes a shell exec untar command. Once the container is killed, the next time the localizer heartbeats to the NM, it will be instructed to die. Inside of the 'die' codepath, the localizer interrupts all of its spawned threads using the cancel() method. However, the untar thread is stuck inside of file I/O waiting to parse the result of the shell execution and is uninterruptible. The untar thread won't get the InterruptedException until it is finished, and so we cannot kill it or the untar shell exec before it completes. We can have the localizer process wait for the untar thread to end via awaitTermination() (currently it only uses shutdownNow()), but it won't return until untar finishes on its own, since shutdown() won't have any effect with interrupting the untar thread. I tested this by replacing the untar shell command with a sleep command so that there would be no worry about the untar actually finishing. The container was killed and instructed to die after the subsequent NM heartbeat. Then it attempted to shutdown all of its threads, but the untar thread would sit in readBytes instead of getting the InterruptedException. Below is the stack trace of the untar thread just after the localizer calls shutdown(). It never gets the InterruptedException and sits in this stack trace until awaitTermination hits its timeout and the localizer kills the JVM. Since we never catch the InterruptedException, we are unable to destroy the untar shell process and it continues to run after the localizer and untar thread are killed (it became owned by init). {noformat} "ContainerLocalizer Downloader" #19 prio=5 os_prio=0 tid=0x00007f4315169800 nid=0x1530 runnable [0x00007f42f5217000] java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x000000076f4fca28> (a java.lang.UNIXProcess$ProcessPipeInputStream) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) - locked <0x000000076f506cf8> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.read1(BufferedReader.java:212) at java.io.BufferedReader.read(BufferedReader.java:286) - locked <0x000000076f506cf8> (a java.io.InputStreamReader) at org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786) at org.apache.hadoop.util.Shell.runCommand(Shell.java:568) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.fs.FileUtil.unTarUsingTar(FileUtil.java:682) at org.apache.hadoop.fs.FileUtil.unTar(FileUtil.java:651) at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:283) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} > Localizer leaves behind tarballs after container is complete > ------------------------------------------------------------ > > Key: YARN-5641 > URL: https://issues.apache.org/jira/browse/YARN-5641 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Eric Badger > Assignee: Eric Badger > > The localizer sometimes fails to clean up extracted tarballs leaving large > footprints that persist on the nodes indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org