Hey guys, have a Tez job on a customer deployment which has very strange symptoms. Maybe you have some pointers for me what to look into!
App master log: ————————————— 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for application appattempt_1455209830330_35852_000001, versionInfo=[ component=tez-dag, version=0.6.0, revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git <http://git-wip-us.apache.org/repos/asf/tez.git>, buildTime=2015-03-03T18:01:25Z ] 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version with AM version, clientVersion=0.6.0, AMVersion=0.6.0 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to jobTokenSecretManager for application 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.container.AMContainerEventType for class org.apache.tez.dag.app.rm.container.AMContainerMap 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.node.AMNodeEventType for class org.apache.tez.dag.app.rm.node.AMNodeTracker 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.DAGEventType for class org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.VertexEventType for class org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.tez.dag.app.DAGAppMaster$4 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.AMSchedulerEventType for class org.apache.tez.dag.app.rm.TaskSchedulerEventHandler 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class org.apache.tez.dag.app.launcher.ContainerLauncherImpl 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit on the thread pool size is 500 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing HistoryEventHandler 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file location for SimpleHistoryLoggingService not specified, defaulting to containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: Initializing SimpleHistoryLoggingService, logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, maxErrors=10 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing RecoveryService 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, launchTime=1455642316658 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: appAttemptId=appattempt_1455209830330_35852_000001, appSubmitTime=1455642293317, … user specific code 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.output.compression.type is deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] common.TezUtilsInternal: Redirecting log file based on addend: dag_1455209830330_35852_1 ————————————— DAG-log: -------------------------- 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file … user specific code 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration. Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby -------------------------- Latest line of the (single) task: -------------------------- ... 2016-02-16 12:07:00,166 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, outputClassName=datameer.plugin.tez.output.DoNothingOutput } 2016-02-16 12:07:00,168 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Creating Input 2016-02-16 12:07:00,168 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Creating Output 2016-02-16 12:07:00,175 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: output 2016-02-16 12:07:00,176 INFO [Initializer 0] runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) 2016-02-16 12:07:00,176 INFO [Initializer 0] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, componentVertexName: Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) 2016-02-16 12:07:00,176 INFO [Initializer 1] resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, componentVertexName: output 2016-02-16 12:07:00,177 INFO [Initializer 1] runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output … user specific code 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby ————————————— The client log contains a shutdown exception: ————————————— INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting down Tez Session, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852 INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0 INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed to shutdown Tez Session via proxy org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0 at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839) at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669) at org.apache.tez.client.TezClient.stop(TezClient.java:476) at datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69) at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173) at datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167) at datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162) at datameer.com.google.common.io.Closer.close(Closer.java:214) at datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147) at datameer.com.google.common.io.Closer.close(Closer.java:214) at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330) at datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65) at datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81) at datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96) at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173) at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22) INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could not connect to AM, killing session via YARN, sessionName=Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), applicationId=application_1455209830330_35852 ————————————— The counter and the event shows that: - the correct amount of input records has been ingested - the expected amount of output records is missing (we have zero records produced which is reflected in both, the counters and the output itself) - for Tez the task and the DAG are both handled as successful The customers says that the symptoms of the job are always the same: - there is that standby warning - there is the problem with the shutdown - (and the data is missing) To me its still unclear if there are strange things going on with Tez or with our application logic. The same job is successful when executed with pure MapReduce. Is the standby thing something to worry about ? best Johannes
