So yes, i think HA is enabled. But this whole behaviour is not a one shot thing, its happen repeatedly for this job. So my guess is its rather a unhappy configuration then the name node switching again and again !?
Johannes > On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected]> wrote: > > The standby warning is because it try to connect to the standby node. (HA > should be enabled I think). Maybe this cause the output missing, could you > attach the whole app log ? > > On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <[email protected] > <mailto:[email protected]>> wrote: > Hey guys, > > have a Tez job on a customer deployment which has very strange symptoms. > Maybe you have some pointers for me what to look into! > > App master log: > ————————————— > 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster > for application appattempt_1455209830330_35852_000001, versionInfo=[ > component=tez-dag, version=0.6.0, > revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, > SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git > <http://git-wip-us.apache.org/repos/asf/tez.git>, > buildTime=2015-03-03T18:01:25Z ] > 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: > fs.default.name <http://fs.default.name/> is deprecated. Instead, use > fs.defaultFS > 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode > is deprecated. Instead, use fs.permissions.umask-mode > 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client > version with AM version, clientVersion=0.6.0, AMVersion=0.6.0 > 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while > connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to > jobTokenSecretManager for application > 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.rm.container.AMContainerEventType for class > org.apache.tez.dag.app.rm.container.AMContainerMap > 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.rm.node.AMNodeEventType for class > org.apache.tez.dag.app.rm.node.AMNodeTracker > 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class > org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler > 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.DAGEventType for class > org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher > 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.VertexEventType for class > org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher > 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.TaskEventType for class > org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher > 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher > 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.SpeculatorEventType for class > org.apache.tez.dag.app.DAGAppMaster$4 > 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.dag.event.SpeculatorEventType for class > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler > 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.rm.AMSchedulerEventType for class > org.apache.tez.dag.app.rm.TaskSchedulerEventHandler > 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class > org.apache.tez.dag.app.rm.NMCommunicatorEventType for class > org.apache.tez.dag.app.launcher.ContainerLauncherImpl > 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: > blacklistDisablePercent is 33, blacklistingEnabled: true, > maxTaskFailuresPerNode: 10 > 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper > limit on the thread pool size is 500 > 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing > HistoryEventHandler > 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log > file location for SimpleHistoryLoggingService not specified, defaulting to > containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001 > 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: > Initializing SimpleHistoryLoggingService, > logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, > maxErrors=10 > 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing > RecoveryService > 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: > [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: > applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, > launchTime=1455642316658 > 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: > [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: > appAttemptId=appattempt_1455209830330_35852_000001, > appSubmitTime=1455642293317, > > … user specific code > > 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] > app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): > ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected > record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) > 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.output.compress is deprecated. Instead, use > mapreduce.output.fileoutputformat.compress > 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: fs.default.name <http://fs.default.name/> is > deprecated. Instead, use fs.defaultFS > 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.map.tasks.speculative.execution is > deprecated. Instead, use mapreduce.map.speculative > 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use > mapreduce.job.reduces > 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use > mapreduce.input.fileinputformat.split.minsize > 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.reduce.tasks.speculative.execution is > deprecated. Instead, use mapreduce.reduce.speculative > 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.output.compression.type is deprecated. > Instead, use mapreduce.output.fileoutputformat.compress.type > 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use > mapreduce.input.fileinputformat.split.maxsize > 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] > Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, > use mapreduce.map.output.compress > 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] > app.DAGAppMaster: Generating DAG graphviz file, > dagId=dag_1455209830330_35852_1, > filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot > 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] > common.TezUtilsInternal: Redirecting log file based on addend: > dag_1455209830330_35852_1 > ————————————— > > DAG-log: > -------------------------- > 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] > app.DAGAppMaster: Running DAG: Export job (349): > ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected > record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) > 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] > impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file > … user specific code > > 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job > (349): ClaimLineExportToDB2#export(Disconnected record stream) > (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root > logger level for hadoop task to WARN: > 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] > hdfs.ClientContext: Existing client context 'default' does not match > requested configuration. Existing: shortCircuitStreamsCacheSize = 256, > shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, > shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = > 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = > 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, > useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, > shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: > shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = > 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = > 3600000, shortCircuitMmapCacheRetryTimeout = 300000, > shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, > socketCacheExpiry = 3000, shortCircuitLocalReads = true, > useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, > shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000 > 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: > Exception encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: > Exception encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > -------------------------- > > > Latest line of the (single) task: > -------------------------- > ... > 2016-02-16 12:07:00,166 INFO [Initializer 1] > runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: > { destinationVertexName=output, physicalEdgeCount=0, > outputClassName=datameer.plugin.tez.output.DoNothingOutput } > 2016-02-16 12:07:00,168 INFO [Initializer 0] > runtime.LogicalIOProcessorRuntimeTask: Creating Input > 2016-02-16 12:07:00,168 INFO [Initializer 1] > runtime.LogicalIOProcessorRuntimeTask: Creating Output > 2016-02-16 12:07:00,175 INFO [Initializer 1] > runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: > output > 2016-02-16 12:07:00,176 INFO [Initializer 0] > runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: > Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record > stream) > 2016-02-16 12:07:00,176 INFO [Initializer 0] > resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, > componentVertexName: Input:Export job (349): > ClaimLineExportToDB2#export(Disconnected record stream) > 2016-02-16 12:07:00,176 INFO [Initializer 1] > resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, > componentVertexName: output > 2016-02-16 12:07:00,177 INFO [Initializer 1] > runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: > output > > … user specific code > > 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger > level for hadoop task to WARN: > 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered > while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby > ————————————— > > The client log contains a shutdown exception: > ————————————— > INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - > Shutting down Tez Session, sessionName=Export job (349): > ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected > record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), > applicationId=application_1455209830330_35852 > INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - > Application not running, applicationId=application_1455209830330_35852, > yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, > trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ > <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session > stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0 > > INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - > Failed to shutdown Tez Session via proxy > org.apache.tez.dag.api.SessionNotRunning: Application not running, > applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, > finalApplicationStatus=SUCCEEDED, > trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ > <http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session > stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0 > > at > org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839) > at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669) > at org.apache.tez.client.TezClient.stop(TezClient.java:476) > at > datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69) > at > datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173) > at > datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167) > at > datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162) > at datameer.com.google.common.io.Closer.close(Closer.java:214) > at > datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147) > at datameer.com.google.common.io.Closer.close(Closer.java:214) > at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330) > at > datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65) > at > datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81) > at > datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96) > at > datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173) > at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22) > INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could > not connect to AM, killing session via YARN, sessionName=Export job (349): > ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected > record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), > applicationId=application_1455209830330_35852 > > ————————————— > > > The counter and the event shows that: > - the correct amount of input records has been ingested > - the expected amount of output records is missing (we have zero records > produced which is reflected in both, the counters and the output itself) > - for Tez the task and the DAG are both handled as successful > > The customers says that the symptoms of the job are always the same: > - there is that standby warning > - there is the problem with the shutdown > - (and the data is missing) > > To me its still unclear if there are strange things going on with Tez or with > our application logic. > The same job is successful when executed with pure MapReduce. > Is the standby thing something to worry about ? > > best > Johannes > > > > -- > Best Regards > > Jeff Zhang
