[
https://issues.apache.org/jira/browse/TEZ-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204355#comment-14204355
]
Rohini Palaniswamy commented on TEZ-1766:
-----------------------------------------
Steps to reproduce:
ant clean test-tez -Dtest.output=true -logfile /tmp/pig-tez-full
Run "ps -ef | grep DAGAppMaster | less" or "ps -ef | grep container | less" at
the end of the run.
Only the java process continues to run and the /bin/bash parent process that
launches the java process is terminated.
Tez 5.1
{code}
"Thread-1" prio=5 tid=0x00007fe9bd819000 nid=0xb3ab in Object.wait()
[0x00000001190fd000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000007c009f1c8> (a
java.util.concurrent.atomic.AtomicBoolean)
at java.lang.Object.wait(Object.java:503)
at
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1759)
- locked <0x00000007c009f1c8> (a
java.util.concurrent.atomic.AtomicBoolean)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}
Tez 5.2
{code}
"Thread-1" prio=5 tid=0x00007fa417843800 nid=0xe133 in Object.wait()
[0x0000000155cb9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000010e0cf438> (a
java.util.concurrent.atomic.AtomicBoolean)
at java.lang.Object.wait(Object.java:503)
at
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:1844)
- locked <0x000000010e0cf438> (a
java.util.concurrent.atomic.AtomicBoolean)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}
Before it went into wait in shutdownhook, it was retrying to unregister itself
for more than half hour, but that MiniCluster was already shutdown.
{code}
"AMShutdownThread" daemon prio=5 tid=0x00007fa412b1c000 nid=0xdb03 waiting on
condition [0x000000015ae7d000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.util.ThreadUtil.sleepAtLeastIgnoreInterrupts(ThreadUtil.java:43)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:154)
at com.sun.proxy.$Proxy17.finishApplicationMaster(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:316)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unregisterApplicationMaster(AMRMClientAsyncImpl.java:157)
- locked <0x000000010e382178> (a java.lang.Object)
at
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.serviceStop(YarnTaskSchedulerService.java:385)
- locked <0x000000010e30a8c0> (a
org.apache.tez.dag.app.rm.YarnTaskSchedulerService)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x000000010e30ab88> (a java.lang.Object)
at
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler.serviceStop(TaskSchedulerEventHandler.java:386)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x000000010e30a8b0> (a java.lang.Object)
at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
at
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1504)
at
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:1643)
- locked <0x000000010e0cef78> (a org.apache.tez.dag.app.DAGAppMaster)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x000000010e0cf208> (a java.lang.Object)
at
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:698)
at java.lang.Thread.run(Thread.java:722)
{code}
kill -15 (SIGTERM) does not work and kill -9 (SIGKILL) is required as it is
hung in shutdown hook. Not sure if this can become a issue in a real cluster,
as NM will try to kill the container even there was a maintenance window and I
hope it does sigkill when sigterm does not work. But in this case the parent
process had already terminated and not sure if that would be a problem.
Found this problem as after running the unit tests for a couple of times, my
unit tests start failing with weird errors (yarn classnotfound errors,
connection refused, unknown host, etc) before giving too many open file handles
error. Then I noticed that a lot of tez DAG AMs were lying around each
consuming 1G memory and the problems went away after I killed them.
> Running pig unit tests leaks few DAGAppMaster jvms
> --------------------------------------------------
>
> Key: TEZ-1766
> URL: https://issues.apache.org/jira/browse/TEZ-1766
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rohini Palaniswamy
>
> I see around 3 to 4 org.apache.tez.dag.app.DAGAppMaster processes being
> leaked at the end of each test-tez run in both 5.1 and 5.2 for different
> tests in each run.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)