It looks like the issue is in MapChainVertex's InputInitializer. It time out after 3 minutes due to the HA issue. Do they implement a custom InputInitializer or using a custom InputFormat ? If they use a custom InputInitializer then it might be the user code issue.
2016-02-16 12:05:21,435 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] graphv2.ClusterTaskRuntime: Initializing cluster task configuration... 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream) (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root logger level for hadoop task to WARN: 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] hdfs.ClientContext: Existing client context 'default' does not match requested configuration. Existing: shortCircuitStreamsCacheSize = 256, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000 *2016-02-16 12:05:21,732* WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby *2016-02-16 12:08:12*,515 WARN [AsyncDispatcher event handler] filesystem.FileSystemNirvana: No file-system provider configuration setup! 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby On Mon, Feb 29, 2016 at 8:14 PM, Johannes Zillmann <[email protected] > wrote: > Here you go: > > > Johannes > > On 26 Feb 2016, at 02:56, Jeff Zhang <[email protected]> wrote: > > Is it possible for you to get the whole app log from your customer ? > > On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann < > [email protected]> wrote: > >> Looks like there is the same warning in MR as well: >> >> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: >> Exception encountered while connecting to the server : >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> >> But in this case it doesn’t seem to have a bad effect… >> Johannes >> >> On 25 Feb 2016, at 10:46, Jeff Zhang <[email protected]> wrote: >> >> Might be, does only tez job affected ? What about other jobs like MR ? >> >> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann < >> [email protected]> wrote: >> >>> So yes, i think HA is enabled. >>> But this whole behaviour is not a one shot thing, its happen repeatedly >>> for this job. So my guess is its rather a unhappy configuration then the >>> name node switching again and again !? >>> >>> Johannes >>> >>> >>> On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected]> wrote: >>> >>> The standby warning is because it try to connect to the standby node. >>> (HA should be enabled I think). Maybe this cause the output missing, could >>> you attach the whole app log ? >>> >>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann < >>> [email protected]> wrote: >>> >>>> Hey guys, >>>> >>>> have a Tez job on a customer deployment which has very strange >>>> symptoms. Maybe you have some pointers for me what to look into! >>>> >>>> App master log: >>>> ————————————— >>>> >>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster >>>> for application appattempt_1455209830330_35852_000001, versionInfo=[ >>>> component=tez-dag, version=0.6.0, >>>> revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, >>>> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, >>>> buildTime=2015-03-03T18:01:25Z ] >>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: >>>> fs.default.name is deprecated. Instead, use fs.defaultFS >>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: >>>> dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode >>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load >>>> native-hadoop library for your platform... using builtin-java classes >>>> where applicable >>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client >>>> version with AM version, clientVersion=0.6.0, AMVersion=0.6.0 >>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered >>>> while connecting to the server : >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>>> Operation category READ is not supported in state standby >>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token >>>> to jobTokenSecretManager for application >>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.rm.container.AMContainerEventType for class >>>> org.apache.tez.dag.app.rm.container.AMContainerMap >>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.rm.node.AMNodeEventType for class >>>> org.apache.tez.dag.app.rm.node.AMNodeTracker >>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class >>>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler >>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.DAGEventType for class >>>> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher >>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.VertexEventType for class >>>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher >>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.TaskEventType for class >>>> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher >>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class >>>> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher >>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >>>> org.apache.tez.dag.app.DAGAppMaster$4 >>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >>>> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler >>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.rm.AMSchedulerEventType for class >>>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler >>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering >>>> class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class >>>> org.apache.tez.dag.app.launcher.ContainerLauncherImpl >>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: >>>> blacklistDisablePercent is 33, blacklistingEnabled: true, >>>> maxTaskFailuresPerNode: 10 >>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper >>>> limit on the thread pool size is 500 >>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: >>>> Initializing HistoryEventHandler >>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log >>>> file location for SimpleHistoryLoggingService not specified, defaulting to >>>> containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001 >>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: >>>> Initializing SimpleHistoryLoggingService, >>>> logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, >>>> maxErrors=10 >>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing >>>> RecoveryService >>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: >>>> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: >>>> applicationId=application_1455209830330_35852, >>>> appSubmitTime=1455642293317, launchTime=1455642316658 >>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: >>>> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: >>>> appAttemptId=appattempt_1455209830330_35852_000001, >>>> appSubmitTime=1455642293317, >>>> >>>> >>>> … user specific code >>>> >>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] >>>> app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): >>>> ClaimLineExportToDB2#export(Disconnected record >>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) >>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.output.compress is deprecated. Instead, >>>> use mapreduce.output.fileoutputformat.compress >>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: fs.default.name is deprecated. Instead, use >>>> fs.defaultFS >>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.map.tasks.speculative.execution is >>>> deprecated. Instead, use mapreduce.map.speculative >>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use >>>> mapreduce.job.reduces >>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.min.split.size is deprecated. Instead, >>>> use mapreduce.input.fileinputformat.split.minsize >>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.reduce.tasks.speculative.execution is >>>> deprecated. Instead, use mapreduce.reduce.speculative >>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.output.compression.type is deprecated. >>>> Instead, use mapreduce.output.fileoutputformat.compress.type >>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.max.split.size is deprecated. Instead, >>>> use mapreduce.input.fileinputformat.split.maxsize >>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] >>>> Configuration.deprecation: mapred.compress.map.output is deprecated. >>>> Instead, use mapreduce.map.output.compress >>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] >>>> app.DAGAppMaster: Generating DAG graphviz file, >>>> dagId=dag_1455209830330_35852_1, >>>> filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot >>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] >>>> common.TezUtilsInternal: Redirecting log file based on addend: >>>> dag_1455209830330_35852_1 >>>> >>>> ————————————— >>>> >>>> DAG-log: >>>> -------------------------- >>>> >>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] >>>> app.DAGAppMaster: Running DAG: Export job (349): >>>> ClaimLineExportToDB2#export(Disconnected record >>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) >>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] >>>> impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history >>>> file >>>> >>>> … user specific code >>>> >>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job >>>> (349): ClaimLineExportToDB2#export(Disconnected record stream) >>>> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root >>>> logger level for hadoop task to WARN: >>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] >>>> hdfs.ClientContext: Existing client context 'default' does not match >>>> requested configuration. Existing: shortCircuitStreamsCacheSize = 256, >>>> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = >>>> 256, shortCircuitMmapCacheExpiryMs = 3600000, >>>> shortCircuitMmapCacheRetryTimeout = 300000, >>>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, >>>> socketCacheExpiry = 3000, shortCircuitLocalReads = false, >>>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, >>>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: >>>> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = >>>> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = >>>> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, >>>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, >>>> socketCacheExpiry = 3000, shortCircuitLocalReads = true, >>>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, >>>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000 >>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: >>>> Exception encountered while connecting to the server : >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>>> Operation category READ is not supported in state standby >>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: >>>> Exception encountered while connecting to the server : >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>>> Operation category READ is not supported in state standby >>>> >>>> -------------------------- >>>> >>>> >>>> Latest line of the (single) task: >>>> -------------------------- >>>> >>>> ... >>>> >>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] >>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output using >>>> OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, >>>> outputClassName=datameer.plugin.tez.output.DoNothingOutput } >>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] >>>> runtime.LogicalIOProcessorRuntimeTask: Creating Input >>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] >>>> runtime.LogicalIOProcessorRuntimeTask: Creating Output >>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] >>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: >>>> output >>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] >>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: >>>> Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record >>>> stream) >>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] >>>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: >>>> INPUT, componentVertexName: Input:Export job (349): >>>> ClaimLineExportToDB2#export(Disconnected record stream) >>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] >>>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: >>>> OUTPUT, componentVertexName: output >>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] >>>> runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: >>>> output >>>> >>>> >>>> … user specific code >>>> >>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root >>>> logger level for hadoop task to WARN: >>>> >>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered >>>> while connecting to the server : >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >>>> Operation category READ is not supported in state standby >>>> >>>> ————————————— >>>> >>>> The client log contains a shutdown exception: >>>> ————————————— >>>> >>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - >>>> Shutting down Tez Session, sessionName=Export job (349): >>>> ClaimLineExportToDB2#export(Disconnected record >>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), >>>> applicationId=application_1455209830330_35852 >>>> INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) >>>> - Application not running, applicationId=application_1455209830330_35852, >>>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, >>>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, >>>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, >>>> killedDAGs=0 >>>> >>>> INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - >>>> Failed to shutdown Tez Session via proxy >>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, >>>> applicationId=application_1455209830330_35852, >>>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, >>>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, >>>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, >>>> killedDAGs=0 >>>> >>>> at >>>> org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839) >>>> at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669) >>>> at org.apache.tez.client.TezClient.stop(TezClient.java:476) >>>> at >>>> datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69) >>>> at >>>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173) >>>> at >>>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167) >>>> at >>>> datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162) >>>> at datameer.com.google.common.io.Closer.close(Closer.java:214) >>>> at >>>> datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147) >>>> at datameer.com.google.common.io.Closer.close(Closer.java:214) >>>> at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330) >>>> at >>>> datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65) >>>> at >>>> datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81) >>>> at >>>> datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97) >>>> at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>>> at >>>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96) >>>> at >>>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173) >>>> at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22) >>>> INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - >>>> Could not connect to AM, killing session via YARN, sessionName=Export job >>>> (349): ClaimLineExportToDB2#export(Disconnected record >>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), >>>> applicationId=application_1455209830330_35852 >>>> >>>> >>>> ————————————— >>>> >>>> >>>> The counter and the event shows that: >>>> - the correct amount of input records has been ingested >>>> - the expected amount of output records is missing (we have zero >>>> records produced which is reflected in both, the counters and the output >>>> itself) >>>> - for Tez the task and the DAG are both handled as successful >>>> >>>> The customers says that the symptoms of the job are always the same: >>>> - there is that standby warning >>>> - there is the problem with the shutdown >>>> - (and the data is missing) >>>> >>>> To me its still unclear if there are strange things going on with Tez >>>> or with our application logic. >>>> The same job is successful when executed with pure MapReduce. >>>> Is the standby thing something to worry about ? >>>> >>>> best >>>> Johannes >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> >> >> > > > -- > Best Regards > > Jeff Zhang > > > > -- Best Regards Jeff Zhang
