[ 
https://issues.apache.org/jira/browse/YARN-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496398#comment-15496398
 ] 

Eric Badger commented on YARN-5641:
-----------------------------------

[~jlowe] and I worked on this for some time yesterday and killing the spawned 
untar shell process is proving to be very difficult. The localizer spawns up 
the untar shell thread, which invokes a shell exec untar command. Once the 
container is killed, the next time the localizer heartbeats to the NM, it will 
be instructed to die. Inside of the 'die' codepath, the localizer interrupts 
all of its spawned threads using the cancel() method. However, the untar thread 
is stuck inside of file I/O waiting to parse the result of the shell execution 
and is uninterruptible. The untar thread won't get the InterruptedException 
until it is finished, and so we cannot kill it or the untar shell exec before 
it completes. We can have the localizer process wait for the untar thread to 
end via awaitTermination() (currently it only uses shutdownNow()), but it won't 
return until untar finishes on its own, since shutdown() won't have any effect 
with interrupting the untar thread. 

I tested this by replacing the untar shell command with a sleep command so that 
there would be no worry about the untar actually finishing. The container was 
killed and instructed to die after the subsequent NM heartbeat. Then it 
attempted to shutdown all of its threads, but the untar thread would sit in 
readBytes instead of getting the InterruptedException. Below is the stack trace 
of the untar thread just after the localizer calls shutdown(). It never gets 
the InterruptedException and sits in this stack trace until awaitTermination 
hits its timeout and the localizer kills the JVM. Since we never catch the 
InterruptedException, we are unable to destroy the untar shell process and it 
continues to run after the localizer and untar thread are killed (it became 
owned by init). 

{noformat}
"ContainerLocalizer Downloader" #19 prio=5 os_prio=0 tid=0x00007f4315169800 
nid=0x1530 runnable [0x00007f42f5217000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileInputStream.readBytes(Native Method)
        at java.io.FileInputStream.read(FileInputStream.java:255)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked <0x000000076f4fca28> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        - locked <0x000000076f506cf8> (a java.io.InputStreamReader)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.read1(BufferedReader.java:212)
        at java.io.BufferedReader.read(BufferedReader.java:286)
        - locked <0x000000076f506cf8> (a java.io.InputStreamReader)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)
        at org.apache.hadoop.util.Shell.run(Shell.java:479)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
        at org.apache.hadoop.fs.FileUtil.unTarUsingTar(FileUtil.java:682)
        at org.apache.hadoop.fs.FileUtil.unTar(FileUtil.java:651)
        at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:283)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

> Localizer leaves behind tarballs after container is complete
> ------------------------------------------------------------
>
>                 Key: YARN-5641
>                 URL: https://issues.apache.org/jira/browse/YARN-5641
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>
> The localizer sometimes fails to clean up extracted tarballs leaving large 
> footprints that persist on the nodes indefinitely. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to