Dear Tez community, a customer is running into following problem with a Tez-0.5 job.
On the client side it looks as follow: - progress reporting shows that all tasks are complete: (DagRunner.java:148) - Progress on 'Import job (26234): Raw History ImportJob#import(Group by operation)#import(Group by operation) (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING: - then there seems to be connectivity problems to the Application Master (AM): INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying connect to server: master/172.30.31.27:34275. Already tried 0 time(s); retry policy is … INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying connect to server: master/172.30.31.27:34275. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) - then the job appears to be failed: INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG ‘XXX' (application=application_1414525508751_5385) with status=FAILED On the AM side it looks as follow: - all tasks are completed: 2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for vertex: vertex_1414525508751_5385_1_01 - our implementation of org.apache.tez.runtime.api.OutputCommitter starts its work - the AM suddenly gets shut down: 2014-11-14 14:31:42,169 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_1414525508751_5385_01_000056, containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0, heldContainers=1, delayedContainers=0, isNew=false 2014-11-14 14:31:42,170 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused container: container_1414525508751_5385_01_000056 2014-11-14 14:32:01,848 INFO [ Thread -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook invoked 2014-11-14 14:32:01,848 INFO [ Thread -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler 2014-11-14 14:32:01,848 INFO [ Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was : true 2014-11-14 14:32:01,849 INFO [Thread-1] org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler 2014-11-14 14:32:01,849 INFO [ Thread -1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping RecoveryService 2014-11-14 14:32:01,849 INFO [ Thread -1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary Stream 2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread] org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take interrupted. Returning 2014-11-14 14:32:01,897 INFO [ Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Output Stream for DAG dag_1414525508751_5385_1 2014-11-14 14:32:01,905 INFO [ Thread -1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: Stopping SimpleHistoryLoggingService, eventQueueBacklog=0 2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: EventQueue take interrupted. Returning 2014-11-14 14:32:01,906 INFO [ Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a stop request to the NM for ContainerId: container_1414525508751_5385_01_000056 2014-11-14 14:32:01,913 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: AllocatedContainerManager Thread interrupted 2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are shutting down, so ignoring this org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313) Caused by: org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106) at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309) Caused by: java.lang.InterruptedException at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400) at java.util.concurrent.FutureTask.get(FutureTask.java:187) at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104) Any idea on what the problem could be ? Is there a timeout for the OutputCommitter hits and the AM is killed by resource-manager ? Does the client kills the application via resource-manager when it can’t reach the application master anymore ? How big the chance is that the AM i unresponsive while its executing the OutputCommitter ? Any help appreciated! Johannes