Hi Hitesh,

Thanks for the quick response.

We are using Pig 0.15.0 and Tez 0.8.2.

I will certainly file that jira.  This application is a batch process which 
runs hourly, and only hangs on some fraction of instances (~ 10%).  If previous 
patterns continue, it shouldn’t be too long before I’m able to get the stack 
dump.

That host is a node manager.  The log messages I originally posted are from one 
of the syslog_dag files.  From where I posted until we kill the application, 
the only messages we see in that log are more ‘Releasing container’ messages, 
the last one being:

2016-03-21 16:39:03,900 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000974, 
containerExpiryTime=1458603543753, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=85, delayedContainers=0, isNew=false


After that, that file is not appended to.

Unfortunately, I no longer have stdout/stderr from one of the script 
invocations that resulted in a hang, but what we see there is a never-ending 
loop of dag progress messages, with some number of completed tasks, some number 
of in-progress tasks, and those numbers never changing.  I’m unsure if there 
are still pending tasks.  I’ll look at that next time it happens.

Looking carefully in the node manager logs, it seems that particular host has 
finished every task it starts.  Stdout shows properly paired messages of the 
form...

2016-03-21 16:38:53 Starting to run new task attempt: 
attempt_1437886552023_169758_3_08_000011_0
2016-03-21 16:38:55 Completed running task attempt: 
attempt_1437886552023_169758_3_08_000011_0


. . . for every task.  Syslog looks like this until we kill the application:

Attempting to fetch new task for container 
container_e11_1437886552023_169758_01_000960
2016-03-21 16:38:53,586 [INFO] [TezChild] |task.ContainerReporter|: Got 
TaskUpdate for containerId= container_e11_1437886552023_169758_01_000960: 5794 
ms after starting to poll. TaskInfo: shouldDie: false, currentTaskAttemptId: 
attempt_1437886552023_169758_3_08_000011_0
2016-03-21 16:38:53,587 [INFO] [main] |common.TezUtilsInternal|: Redirecting 
log file based on addend: attempt_1437886552023_169758_3_08_000011_0
2016-03-21 16:38:55,078 [INFO] [TezChild] |task.ContainerReporter|: Attempting 
to fetch new task for container container_e11_1437886552023_169758_01_000960
2016-03-21 16:39:24,929 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
for 200ms before retrying getTask again. Got null now. Next getTask sleep 
message after 30000ms
2016-03-21 16:39:55,069 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
for 200ms before retrying getTask again. Got null now. Next getTask sleep 
message after 30000ms
2016-03-21 16:40:25,192 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
for 200ms before retrying getTask again. Got null now. Next getTask sleep 
message after 30000ms
2016-03-21 16:40:55,314 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
for 200ms before retrying getTask again. Got null now. Next getTask sleep 
message after 30000ms
2016-03-21 16:41:25,437 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
for 200ms before retrying getTask again. Got null now. Next getTask sleep 
message after 30000ms
. . . etc.


Is there anything else I can provide for now?

Thanks,
Kurt



On 3/23/16, 11:43 AM, "Hitesh Shah" <[email protected]> wrote:

>Hello Kurt, 
>
>Can you file a jira with a stack dump for the ApplicationMaster process when 
>it is in this hung state and also include all the application master logs. 
>Also please mention what version of Pig and Tez you are running. 
>
>The main question would be whether the AM is really hung or does it look stuck 
>if it is waiting on a particular task ( or set of tasks ) to complete. “No 
>taskRequests” implies that the dag has no pending tasks to run which means all 
>the tasks it needs to run are either already running on assigned containers or 
>completed. Is "10.102.173.86 “ a node manager? If yes, are there are any tasks 
>running on it which seem to be stuck? We can probably figure some of this out 
>from the syslog_dag_* file. 
>
>thanks
>— Hitesh 
>
>On Mar 23, 2016, at 11:18 AM, Kurt Muehlner <[email protected]> wrote:
>
>> I posted about this issue in the Pig user mailing list as well, but thought 
>> I’d try here too.
>> 
>> I have recently been testing converting an existing Pig M/R application to 
>> run on Tez.  I’ve had to work around a few issues, but the performance 
>> improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).
>> 
>> Currently the problem I’m running into is that occasionally when processing 
>> a DAG the application hangs.  When this happens, I find the following in the 
>> syslog for that dag:
>> 
>> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>> delay expired or is new. Releasing container, 
>> containerId=container_e11_1437886552023_169758_01_000822, 
>> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
>> heldContainers=112, delayedContainers=27, isNew=false
>> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>> delay expired or is new. Releasing container, 
>> containerId=container_e11_1437886552023_169758_01_000824, 
>> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
>> heldContainers=111, delayedContainers=26, isNew=false
>> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
>> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
>> 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
>> java.io.IOException: Connection reset by peer
>>        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>>        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>>        at 
>> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>>        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>>        at 
>> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>>        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
>> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>> delay expired or is new. Releasing container, 
>> containerId=container_e11_1437886552023_169758_01_000811, 
>> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
>> heldContainers=110, delayedContainers=25, isNew=false
>> 2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] 
>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>> delay expired or is new. Releasing container, 
>> containerId=container_e11_1437886552023_169758_01_000963, 
>> containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, 
>> heldContainers=109, delayedContainers=24, isNew=false
>> 2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] 
>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>> delay expired or is new. Releasing container, 
>> containerId=container_e11_1437886552023_169758_01_000881, 
>> containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, 
>> heldContainers=108, delayedContainers=23, isNew=false
>> 
>> 
>> It will continue logging some number more ‘Releasing container’ messages, 
>> and then soon stop all logging, and stop submitting tasks. I also do not see 
>> any errors or exceptions in the container logs for the host identified in 
>> the IOException.  Is there some other place I should look on that host to 
>> find an indication of what’s going wrong?
>> 
>> Any thoughts on what’s going on here?  Is this a state from which an 
>> application should be able to recover?  We do not see the application hang 
>> when running on M/R.
>> 
>> One thing I tried to work around the hang was to enable speculation, on the 
>> theory that some task failed to send some state change event to the AM, and 
>> that speculation might allow that task to be retried.  Unfortunately when I 
>> do that, I intermittently run into TEZ-3148 
>> <https://issues.apache.org/jira/browse/TEZ-3148>
>> 
>> Any insights or workaround suggestions most appreciated!
>> 
>> -Kurt
>> 
>> 
>> 
>> 
>

Reply via email to