[ 
https://issues.apache.org/jira/browse/TEZ-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560312#comment-14560312
 ] 

Siddharth Seth commented on TEZ-2475:
-------------------------------------

My best guess here is a RuntimeException in the 
LocalContainerLauncher-SubTaskRunner thread while creating a TezChild instance. 
These exception aren't caught or logged anywhere. I'm assuming the trace and 
the logs on this jira are unrelated.

That's the last message during TezChild creation.
{code}2015-05-26 13:10:23,128 WARN  [LocalContainerLauncher-SubTaskRunner] 
token.Token (Token.java:getClassForIdentifier(121)) - Cannot find class for 
token kind tez.job{code}

After this, the LocalTaskExecutionThread doesn't show up at all - which leads 
me to believe the failure happened during TezChild construction itself. The 
previous container holding on to the thread (single thread pool) would have 
generated log messages when the previous container would've tried fetching new 
work.

A patch to at least log exceptions when the sub-task-runner is about to die 
should be simple. That should help diagnose this further.

[~fs111] - is it possible to get instructions on how to reproduce this ? Also a 
set of logs / stack trace when this happens next.

> Tez local mode hanging in big testsuite
> ---------------------------------------
>
>                 Key: TEZ-2475
>                 URL: https://issues.apache.org/jira/browse/TEZ-2475
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.6.1
>            Reporter: André Kelpe
>         Attachments: 2015-05-21_15-55-20_buildLog.log.gz
>
>
> we have a big test suite for lingual, our SQL layer for cascading. We are 
> trying very hard to make it work correctly on Tez, but I am stuck:
> The setup is a huge suite of SQL based tests (6000+), which are being 
> executed in order in local mode. At certain moments the whole process just 
> stops. Nothing gets executed any longer. This is not all the time, but quite 
> often. Note that it is not happening at the same line of code, more at 
> random, which makes it quite complex to debug.
> What I am seeing, is these kind of stacktraces in the middle of the run:
> 2015-05-21 16:07:42,413 ERROR [TaskHeartbeatThread] task.TezTaskRunner 
> (TezTaskRunner.java:reportError(333)) - TaskReporter reported error
>     java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2188)
>         at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:187)
>         at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> This looks like it could be related to the hang, but the hang is not 
> happening immediately afterwards, but some time later.
> I have gone through quite a few JIRAs and saw that there were problems with 
> locks and hanging threads before, which should be fixed, but it still happens.
> I have tried 0.6.1 and 0.7.0. Both show the same behaviour.
> This gist contains a thread dump of a hanging build: 
> https://gist.github.com/fs111/1ee44469bf5cc31e5a52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to