Might be, does only tez job affected ? What about other jobs like MR ? On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <[email protected] > wrote:
> So yes, i think HA is enabled. > But this whole behaviour is not a one shot thing, its happen repeatedly > for this job. So my guess is its rather a unhappy configuration then the > name node switching again and again !? > > Johannes > > > On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected]> wrote: > > The standby warning is because it try to connect to the standby node. (HA > should be enabled I think). Maybe this cause the output missing, could you > attach the whole app log ? > > On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann < > [email protected]> wrote: > >> Hey guys, >> >> have a Tez job on a customer deployment which has very strange symptoms. >> Maybe you have some pointers for me what to look into! >> >> App master log: >> ————————————— >> >> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster >> for application appattempt_1455209830330_35852_000001, versionInfo=[ >> component=tez-dag, version=0.6.0, >> revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, >> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, >> buildTime=2015-03-03T18:01:25Z ] >> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: >> fs.default.name is deprecated. Instead, use fs.defaultFS >> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode >> is deprecated. Instead, use fs.permissions.umask-mode >> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java classes where >> applicable >> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client >> version with AM version, clientVersion=0.6.0, AMVersion=0.6.0 >> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while >> connecting to the server : >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token >> to jobTokenSecretManager for application >> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.rm.container.AMContainerEventType for class >> org.apache.tez.dag.app.rm.container.AMContainerMap >> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.rm.node.AMNodeEventType for class >> org.apache.tez.dag.app.rm.node.AMNodeTracker >> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class >> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler >> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.DAGEventType for class >> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher >> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.VertexEventType for class >> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher >> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.TaskEventType for class >> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher >> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class >> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher >> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >> org.apache.tez.dag.app.DAGAppMaster$4 >> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.dag.event.SpeculatorEventType for class >> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler >> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.rm.AMSchedulerEventType for class >> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler >> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class >> org.apache.tez.dag.app.rm.NMCommunicatorEventType for class >> org.apache.tez.dag.app.launcher.ContainerLauncherImpl >> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: >> blacklistDisablePercent is 33, blacklistingEnabled: true, >> maxTaskFailuresPerNode: 10 >> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper >> limit on the thread pool size is 500 >> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: >> Initializing HistoryEventHandler >> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log >> file location for SimpleHistoryLoggingService not specified, defaulting to >> containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001 >> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: >> Initializing SimpleHistoryLoggingService, >> logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001, >> maxErrors=10 >> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing >> RecoveryService >> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: >> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: >> applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, >> launchTime=1455642316658 >> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: >> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: >> appAttemptId=appattempt_1455209830330_35852_000001, >> appSubmitTime=1455642293317, >> >> >> … user specific code >> >> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] >> app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): >> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected >> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) >> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.output.compress is deprecated. Instead, >> use mapreduce.output.fileoutputformat.compress >> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: fs.default.name is deprecated. Instead, use >> fs.defaultFS >> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.map.tasks.speculative.execution is >> deprecated. Instead, use mapreduce.map.speculative >> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use >> mapreduce.job.reduces >> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use >> mapreduce.input.fileinputformat.split.minsize >> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.reduce.tasks.speculative.execution is >> deprecated. Instead, use mapreduce.reduce.speculative >> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.output.compression.type is deprecated. >> Instead, use mapreduce.output.fileoutputformat.compress.type >> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use >> mapreduce.input.fileinputformat.split.maxsize >> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] >> Configuration.deprecation: mapred.compress.map.output is deprecated. >> Instead, use mapreduce.map.output.compress >> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] >> app.DAGAppMaster: Generating DAG graphviz file, >> dagId=dag_1455209830330_35852_1, >> filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot >> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] >> common.TezUtilsInternal: Redirecting log file based on addend: >> dag_1455209830330_35852_1 >> >> ————————————— >> >> DAG-log: >> -------------------------- >> >> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] >> app.DAGAppMaster: Running DAG: Export job (349): >> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected >> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf) >> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] >> impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file >> >> … user specific code >> >> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job >> (349): ClaimLineExportToDB2#export(Disconnected record stream) >> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root >> logger level for hadoop task to WARN: >> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] >> hdfs.ClientContext: Existing client context 'default' does not match >> requested configuration. Existing: shortCircuitStreamsCacheSize = 256, >> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, >> shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = >> 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = >> 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, >> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, >> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: >> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = >> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = >> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, >> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, >> socketCacheExpiry = 3000, shortCircuitLocalReads = true, >> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, >> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000 >> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: >> Exception encountered while connecting to the server : >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: >> Exception encountered while connecting to the server : >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> >> -------------------------- >> >> >> Latest line of the (single) task: >> -------------------------- >> >> ... >> >> 2016-02-16 12:07:00,166 INFO [Initializer 1] >> runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: >> { destinationVertexName=output, physicalEdgeCount=0, >> outputClassName=datameer.plugin.tez.output.DoNothingOutput } >> 2016-02-16 12:07:00,168 INFO [Initializer 0] >> runtime.LogicalIOProcessorRuntimeTask: Creating Input >> 2016-02-16 12:07:00,168 INFO [Initializer 1] >> runtime.LogicalIOProcessorRuntimeTask: Creating Output >> 2016-02-16 12:07:00,175 INFO [Initializer 1] >> runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: >> output >> 2016-02-16 12:07:00,176 INFO [Initializer 0] >> runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: >> Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record >> stream) >> 2016-02-16 12:07:00,176 INFO [Initializer 0] >> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, >> componentVertexName: Input:Export job (349): >> ClaimLineExportToDB2#export(Disconnected record stream) >> 2016-02-16 12:07:00,176 INFO [Initializer 1] >> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: >> OUTPUT, componentVertexName: output >> 2016-02-16 12:07:00,177 INFO [Initializer 1] >> runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: >> output >> >> >> … user specific code >> >> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root >> logger level for hadoop task to WARN: >> >> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered >> while connecting to the server : >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): >> Operation category READ is not supported in state standby >> >> ————————————— >> >> The client log contains a shutdown exception: >> ————————————— >> >> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - >> Shutting down Tez Session, sessionName=Export job (349): >> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected >> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), >> applicationId=application_1455209830330_35852 >> INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - >> Application not running, applicationId=application_1455209830330_35852, >> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, >> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, >> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, >> killedDAGs=0 >> >> INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - >> Failed to shutdown Tez Session via proxy >> org.apache.tez.dag.api.SessionNotRunning: Application not running, >> applicationId=application_1455209830330_35852, >> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, >> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, >> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, >> killedDAGs=0 >> >> at >> org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839) >> at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669) >> at org.apache.tez.client.TezClient.stop(TezClient.java:476) >> at >> datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69) >> at >> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173) >> at >> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167) >> at >> datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162) >> at datameer.com.google.common.io.Closer.close(Closer.java:214) >> at >> datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147) >> at datameer.com.google.common.io.Closer.close(Closer.java:214) >> at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330) >> at >> datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65) >> at >> datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81) >> at >> datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> at >> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96) >> at >> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173) >> at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22) >> INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - >> Could not connect to AM, killing session via YARN, sessionName=Export job >> (349): ClaimLineExportToDB2#export(Disconnected record >> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), >> applicationId=application_1455209830330_35852 >> >> >> ————————————— >> >> >> The counter and the event shows that: >> - the correct amount of input records has been ingested >> - the expected amount of output records is missing (we have zero records >> produced which is reflected in both, the counters and the output itself) >> - for Tez the task and the DAG are both handled as successful >> >> The customers says that the symptoms of the job are always the same: >> - there is that standby warning >> - there is the problem with the shutdown >> - (and the data is missing) >> >> To me its still unclear if there are strange things going on with Tez or >> with our application logic. >> The same job is successful when executed with pure MapReduce. >> Is the standby thing something to worry about ? >> >> best >> Johannes >> > > > > -- > Best Regards > > Jeff Zhang > > > -- Best Regards Jeff Zhang
