I posted about this issue in the Pig user mailing list as well, but thought I’d 
try here too.

I have recently been testing converting an existing Pig M/R application to run 
on Tez.  I’ve had to work around a few issues, but the performance improvement 
is significant (~ 25 minutes on M/R, 5 minutes on Tez).

Currently the problem I’m running into is that occasionally when processing a 
DAG the application hangs.  When this happens, I find the following in the 
syslog for that dag:

016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000822, 
containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=112, delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000824, 
containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=111, delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: 
Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw 
exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000811, 
containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=110, delayedContainers=25, isNew=false
2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000963, 
containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=109, delayedContainers=24, isNew=false
2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000881, 
containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=108, delayedContainers=23, isNew=false


It will continue logging some number more ‘Releasing container’ messages, and 
then soon stop all logging, and stop submitting tasks. I also do not see any 
errors or exceptions in the container logs for the host identified in the 
IOException.  Is there some other place I should look on that host to find an 
indication of what’s going wrong?

Any thoughts on what’s going on here?  Is this a state from which an 
application should be able to recover?  We do not see the application hang when 
running on M/R.

One thing I tried to work around the hang was to enable speculation, on the 
theory that some task failed to send some state change event to the AM, and 
that speculation might allow that task to be retried.  Unfortunately when I do 
that, I intermittently run into TEZ-3148 
<https://issues.apache.org/jira/browse/TEZ-3148>

Any insights or workaround suggestions most appreciated!

-Kurt




Reply via email to