Hi Hitesh.  I should be able to get that soon, and put into JIRA.  There’s some 
work I need to do before I can provide those logs.

I did find these interesting lines in the log on host 10.102.173.86:

2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
|retry.RetryInvocationHandler|: A failover has occurred since the start of this 
method invocation attempt.
2016-03-21 16:36:13,658 [WARN] [TezChild] |data.SchemaTupleBackend|: 
SchemaTupleBackend has already been initialized
2016-03-21 16:36:17,884 [WARN] [I/O Setup 2 Initialize: {scope-178}] 
|retry.RetryInvocationHandler|: A failover has occurred since the start of this 
method invocation attempt.
2016-03-21 16:36:17,901 [WARN] [TezChild] |data.SchemaTupleBackend|: 
SchemaTupleBackend has already been initialized


I hadn’t spotted those before.  I’ll see if I can possibly correlate instances 
of the application hanging with failovers.

-Kurt





On 3/23/16, 2:40 PM, "Hitesh Shah" <[email protected]> wrote:

>For any of the earlier hangs when the app was killed, would you mind attaching 
>the full app logs obtained via “bin/yarn logs …” to the jira too. That could 
>be something to look at until you get the stack dump. 
>
>thanks
>— Hitesh
>
>On Mar 23, 2016, at 1:57 PM, Kurt Muehlner <[email protected]> wrote:
>
>> Hi Hitesh,
>> 
>> Thanks for the quick response.
>> 
>> We are using Pig 0.15.0 and Tez 0.8.2.
>> 
>> I will certainly file that jira.  This application is a batch process which 
>> runs hourly, and only hangs on some fraction of instances (~ 10%).  If 
>> previous patterns continue, it shouldn’t be too long before I’m able to get 
>> the stack dump.
>> 
>> That host is a node manager.  The log messages I originally posted are from 
>> one of the syslog_dag files.  From where I posted until we kill the 
>> application, the only messages we see in that log are more ‘Releasing 
>> container’ messages, the last one being:
>> 
>> 2016-03-21 16:39:03,900 [INFO] [DelayedContainerManager] 
>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>> delay expired or is new. Releasing container, 
>> containerId=container_e11_1437886552023_169758_01_000974, 
>> containerExpiryTime=1458603543753, idleTimeout=5000, taskRequestsCount=0, 
>> heldContainers=85, delayedContainers=0, isNew=false
>> 
>> 
>> After that, that file is not appended to.
>> 
>> Unfortunately, I no longer have stdout/stderr from one of the script 
>> invocations that resulted in a hang, but what we see there is a never-ending 
>> loop of dag progress messages, with some number of completed tasks, some 
>> number of in-progress tasks, and those numbers never changing.  I’m unsure 
>> if there are still pending tasks.  I’ll look at that next time it happens.
>> 
>> Looking carefully in the node manager logs, it seems that particular host 
>> has finished every task it starts.  Stdout shows properly paired messages of 
>> the form...
>> 
>> 2016-03-21 16:38:53 Starting to run new task attempt: 
>> attempt_1437886552023_169758_3_08_000011_0
>> 2016-03-21 16:38:55 Completed running task attempt: 
>> attempt_1437886552023_169758_3_08_000011_0
>> 
>> 
>> . . . for every task.  Syslog looks like this until we kill the application:
>> 
>> Attempting to fetch new task for container 
>> container_e11_1437886552023_169758_01_000960
>> 2016-03-21 16:38:53,586 [INFO] [TezChild] |task.ContainerReporter|: Got 
>> TaskUpdate for containerId= container_e11_1437886552023_169758_01_000960: 
>> 5794 ms after starting to poll. TaskInfo: shouldDie: false, 
>> currentTaskAttemptId: attempt_1437886552023_169758_3_08_000011_0
>> 2016-03-21 16:38:53,587 [INFO] [main] |common.TezUtilsInternal|: Redirecting 
>> log file based on addend: attempt_1437886552023_169758_3_08_000011_0
>> 2016-03-21 16:38:55,078 [INFO] [TezChild] |task.ContainerReporter|: 
>> Attempting to fetch new task for container 
>> container_e11_1437886552023_169758_01_000960
>> 2016-03-21 16:39:24,929 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
>> for 200ms before retrying getTask again. Got null now. Next getTask sleep 
>> message after 30000ms
>> 2016-03-21 16:39:55,069 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
>> for 200ms before retrying getTask again. Got null now. Next getTask sleep 
>> message after 30000ms
>> 2016-03-21 16:40:25,192 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
>> for 200ms before retrying getTask again. Got null now. Next getTask sleep 
>> message after 30000ms
>> 2016-03-21 16:40:55,314 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
>> for 200ms before retrying getTask again. Got null now. Next getTask sleep 
>> message after 30000ms
>> 2016-03-21 16:41:25,437 [INFO] [TezChild] |task.ContainerReporter|: Sleeping 
>> for 200ms before retrying getTask again. Got null now. Next getTask sleep 
>> message after 30000ms
>> . . . etc.
>> 
>> 
>> Is there anything else I can provide for now?
>> 
>> Thanks,
>> Kurt
>> 
>> 
>> 
>> On 3/23/16, 11:43 AM, "Hitesh Shah" <[email protected]> wrote:
>> 
>>> Hello Kurt, 
>>> 
>>> Can you file a jira with a stack dump for the ApplicationMaster process 
>>> when it is in this hung state and also include all the application master 
>>> logs. Also please mention what version of Pig and Tez you are running. 
>>> 
>>> The main question would be whether the AM is really hung or does it look 
>>> stuck if it is waiting on a particular task ( or set of tasks ) to 
>>> complete. “No taskRequests” implies that the dag has no pending tasks to 
>>> run which means all the tasks it needs to run are either already running on 
>>> assigned containers or completed. Is "10.102.173.86 “ a node manager? If 
>>> yes, are there are any tasks running on it which seem to be stuck? We can 
>>> probably figure some of this out from the syslog_dag_* file. 
>>> 
>>> thanks
>>> — Hitesh 
>>> 
>>> On Mar 23, 2016, at 11:18 AM, Kurt Muehlner <[email protected]> wrote:
>>> 
>>>> I posted about this issue in the Pig user mailing list as well, but 
>>>> thought I’d try here too.
>>>> 
>>>> I have recently been testing converting an existing Pig M/R application to 
>>>> run on Tez.  I’ve had to work around a few issues, but the performance 
>>>> improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).
>>>> 
>>>> Currently the problem I’m running into is that occasionally when 
>>>> processing a DAG the application hangs.  When this happens, I find the 
>>>> following in the syslog for that dag:
>>>> 
>>>> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
>>>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>>>> delay expired or is new. Releasing container, 
>>>> containerId=container_e11_1437886552023_169758_01_000822, 
>>>> containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
>>>> heldContainers=112, delayedContainers=27, isNew=false
>>>> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
>>>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>>>> delay expired or is new. Releasing container, 
>>>> containerId=container_e11_1437886552023_169758_01_000824, 
>>>> containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
>>>> heldContainers=111, delayedContainers=26, isNew=false
>>>> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] 
>>>> |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 
>>>> 10.102.173.86 threw exception [java.io.IOException: Connection reset by 
>>>> peer]
>>>> java.io.IOException: Connection reset by peer
>>>>       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>       at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>       at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>>>>       at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>>>>       at 
>>>> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>>>>       at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>>>>       at 
>>>> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>>>>       at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
>>>> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
>>>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>>>> delay expired or is new. Releasing container, 
>>>> containerId=container_e11_1437886552023_169758_01_000811, 
>>>> containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
>>>> heldContainers=110, delayedContainers=25, isNew=false
>>>> 2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] 
>>>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>>>> delay expired or is new. Releasing container, 
>>>> containerId=container_e11_1437886552023_169758_01_000963, 
>>>> containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, 
>>>> heldContainers=109, delayedContainers=24, isNew=false
>>>> 2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] 
>>>> |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout 
>>>> delay expired or is new. Releasing container, 
>>>> containerId=container_e11_1437886552023_169758_01_000881, 
>>>> containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, 
>>>> heldContainers=108, delayedContainers=23, isNew=false
>>>> 
>>>> 
>>>> It will continue logging some number more ‘Releasing container’ messages, 
>>>> and then soon stop all logging, and stop submitting tasks. I also do not 
>>>> see any errors or exceptions in the container logs for the host identified 
>>>> in the IOException.  Is there some other place I should look on that host 
>>>> to find an indication of what’s going wrong?
>>>> 
>>>> Any thoughts on what’s going on here?  Is this a state from which an 
>>>> application should be able to recover?  We do not see the application hang 
>>>> when running on M/R.
>>>> 
>>>> One thing I tried to work around the hang was to enable speculation, on 
>>>> the theory that some task failed to send some state change event to the 
>>>> AM, and that speculation might allow that task to be retried.  
>>>> Unfortunately when I do that, I intermittently run into TEZ-3148 
>>>> <https://issues.apache.org/jira/browse/TEZ-3148>
>>>> 
>>>> Any insights or workaround suggestions most appreciated!
>>>> 
>>>> -Kurt
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>

Reply via email to