Dear Tez community,

a customer is running into following problem with a Tez-0.5 job.

On the client side it looks as follow:

- progress reporting shows that all tasks are complete:
        (DagRunner.java:148) - Progress on 'Import job (26234): Raw History 
ImportJob#import(Group by operation)#import(Group by operation) 
(b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING: 
- then there seems to be connectivity problems to the Application Master (AM):
        INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying connect to 
server: master/172.30.31.27:34275. Already tried 0 time(s); retry policy is 
        …
        INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying connect to 
server: master/172.30.31.27:34275. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
- then the job appears to be failed:
        INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG ‘XXX' 
(application=application_1414525508751_5385) with status=FAILED


On the AM side it looks as follow:
- all tasks are completed:
        2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler] 
org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for vertex: 
vertex_1414525508751_5385_1_01
- our implementation of org.apache.tez.runtime.api.OutputCommitter starts its 
work
- the AM suddenly gets shut down:
        2014-11-14 14:31:42,169 INFO [DelayedContainerManager] 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests. 
Container's idle timeout delay expired or is new. Releasing container, 
containerId=container_1414525508751_5385_01_000056, 
containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=1, delayedContainers=0, isNew=false
        2014-11-14 14:31:42,170 INFO [DelayedContainerManager] 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused container: 
container_1414525508751_5385_01_000056
        2014-11-14 14:32:01,848 INFO [
        Thread
        -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook 
invoked
        2014-11-14 14:32:01,848 INFO [
        Thread
        -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a 
signal. Signaling TaskScheduler
        2014-11-14 14:32:01,848 INFO [
        Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: 
TaskScheduler notified that iSignalled was : true
        2014-11-14 14:32:01,849 INFO [Thread-1] 
org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping 
RecoveryService
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary Stream
2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread] 
org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take 
interrupted. Returning
2014-11-14 14:32:01,897 INFO [
Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Output 
Stream for
DAG dag_1414525508751_5385_1
2014-11-14 14:32:01,905 INFO [
Thread
-1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: 
Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread] 
org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: EventQueue 
take interrupted. Returning
2014-11-14 14:32:01,906 INFO [
Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a stop 
request to the NM for
ContainerId: container_1414525508751_5385_01_000056
2014-11-14 14:32:01,913 INFO [DelayedContainerManager] 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService: AllocatedContainerManager 
Thread
interrupted
2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler 
Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[AMRM 
Callback Handler Thread,5,main] threw an Throwable, but we are shutting down, 
so ignoring this

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
        at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
Caused by: org.apache.tez.dag.api.TezUncheckedException: 
java.lang.InterruptedException
        at 
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
        at 
org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
        at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
        at java.util.concurrent.FutureTask.get(FutureTask.java:187)
        at 
org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)

Any idea on what the problem could be ?
Is there a timeout for the OutputCommitter hits and the AM is killed by 
resource-manager ?
Does the client kills the application via resource-manager when it can’t reach 
the application master anymore ?
How big the chance is that the AM i unresponsive while its executing the 
OutputCommitter ?

Any help appreciated!
Johannes

Reply via email to