[ 
https://issues.apache.org/jira/browse/TEZ-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204355#comment-14204355
 ] 

Rohini Palaniswamy commented on TEZ-1766:
-----------------------------------------

Steps to reproduce:
 ant clean test-tez -Dtest.output=true -logfile /tmp/pig-tez-full

Run "ps -ef | grep DAGAppMaster | less" or "ps -ef | grep container | less" at 
the end of the run.

Only the java process continues to run and the /bin/bash parent process that 
launches the java process is terminated.

Tez 5.1
{code}
"Thread-1" prio=5 tid=0x00007fe9bd819000 nid=0xb3ab in Object.wait() 
[0x00000001190fd000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007c009f1c8> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at java.lang.Object.wait(Object.java:503)
        at 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1759)
        - locked <0x00000007c009f1c8> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}

Tez 5.2
{code}
"Thread-1" prio=5 tid=0x00007fa417843800 nid=0xe133 in Object.wait() 
[0x0000000155cb9000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x000000010e0cf438> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at java.lang.Object.wait(Object.java:503)
        at 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1844)
        - locked <0x000000010e0cf438> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code} 

Before it went into wait in shutdownhook, it was retrying to unregister itself 
for more than half hour, but that MiniCluster was already shutdown.

{code}
"AMShutdownThread" daemon prio=5 tid=0x00007fa412b1c000 nid=0xdb03 waiting on 
condition [0x000000015ae7d000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:154)
        at com.sun.proxy.$Proxy17.finishApplicationMaster(Unknown Source)
        at 
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:316)
        at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unregisterApplicationMaster(AMRMClientAsyncImpl.java:157)
        - locked <0x000000010e382178> (a java.lang.Object)
        at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.serviceStop(YarnTaskSchedulerService.java:385)
        - locked <0x000000010e30a8c0> (a 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService)
        at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x000000010e30ab88> (a java.lang.Object)
        at 
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskSchedulerEventHandler.java:386)
        at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x000000010e30a8b0> (a java.lang.Object)
        at 
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at 
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at 
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1504)
        at 
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1643)
        - locked <0x000000010e0cef78> (a org.apache.tez.dag.app.DAGAppMaster)
        at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x000000010e0cf208> (a java.lang.Object)
        at 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:698)
        at java.lang.Thread.run(Thread.java:722)
{code}

kill -15 (SIGTERM) does not work and kill -9 (SIGKILL) is required as it is 
hung in shutdown hook.  Not sure if this can become a issue in a real cluster, 
as NM will try to kill the container even there was a maintenance window and I 
hope it does sigkill when sigterm does not work. But in this case the parent 
process had already terminated and not sure if that would be a problem.

Found this problem as after running the unit tests for a couple of times, my 
unit tests start failing with weird errors (yarn classnotfound errors, 
connection refused, unknown host, etc) before giving too many open file handles 
error. Then I noticed that a lot of tez DAG AMs were lying around each 
consuming 1G memory and the problems went away after I killed them.  

> Running pig unit tests leaks few DAGAppMaster jvms
> --------------------------------------------------
>
>                 Key: TEZ-1766
>                 URL: https://issues.apache.org/jira/browse/TEZ-1766
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>
> I see around 3 to 4 org.apache.tez.dag.app.DAGAppMaster processes being 
> leaked at the end of each test-tez run in both 5.1 and 5.2 for different 
> tests in each run. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to