I posted about this issue in the Pig user mailing list as well, but thought I’d
try here too.
I have recently been testing converting an existing Pig M/R application to run
on Tez. I’ve had to work around a few issues, but the performance improvement
is significant (~ 25 minutes on M/R, 5 minutes on Tez).
Currently the problem I’m running into is that occasionally when processing a
DAG the application hangs. When this happens, I find the following in the
syslog for that dag:
016-03-21 16:39:01,643 [INFO] [DelayedContainerManager]
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay
expired or is new. Releasing container,
containerId=container_e11_1437886552023_169758_01_000822,
containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0,
heldContainers=112, delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager]
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay
expired or is new. Releasing container,
containerId=container_e11_1437886552023_169758_01_000824,
containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0,
heldContainers=111, delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|:
Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw
exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
at
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
at
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager]
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay
expired or is new. Releasing container,
containerId=container_e11_1437886552023_169758_01_000811,
containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0,
heldContainers=110, delayedContainers=25, isNew=false
2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager]
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay
expired or is new. Releasing container,
containerId=container_e11_1437886552023_169758_01_000963,
containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0,
heldContainers=109, delayedContainers=24, isNew=false
2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager]
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay
expired or is new. Releasing container,
containerId=container_e11_1437886552023_169758_01_000881,
containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0,
heldContainers=108, delayedContainers=23, isNew=false
It will continue logging some number more ‘Releasing container’ messages, and
then soon stop all logging, and stop submitting tasks. I also do not see any
errors or exceptions in the container logs for the host identified in the
IOException. Is there some other place I should look on that host to find an
indication of what’s going wrong?
Any thoughts on what’s going on here? Is this a state from which an
application should be able to recover? We do not see the application hang when
running on M/R.
One thing I tried to work around the hang was to enable speculation, on the
theory that some task failed to send some state change event to the AM, and
that speculation might allow that task to be retried. Unfortunately when I do
that, I intermittently run into TEZ-3148
<https://issues.apache.org/jira/browse/TEZ-3148>
Any insights or workaround suggestions most appreciated!
-Kurt