Is it possible for you to get the whole app log from your customer ?

On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <[email protected]
> wrote:

> Looks like there is the same warning in MR as well:
>
> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception
> encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
>
> But in this case it doesn’t seem to have a bad effect…
> Johannes
>
> On 25 Feb 2016, at 10:46, Jeff Zhang <[email protected]> wrote:
>
> Might be, does only tez job affected ? What about other jobs like MR ?
>
> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann <
> [email protected]> wrote:
>
>> So yes, i think HA is enabled.
>> But this whole behaviour is not a one shot thing, its happen repeatedly
>> for this job. So my guess is its rather a unhappy configuration then the
>> name node switching again and again !?
>>
>> Johannes
>>
>>
>> On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected]> wrote:
>>
>> The standby warning is because it try to connect to the standby node. (HA
>> should be enabled I think).  Maybe this cause the output missing, could you
>> attach the whole app log ?
>>
>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann <
>> [email protected]> wrote:
>>
>>> Hey guys,
>>>
>>> have a Tez job on a customer deployment which has very strange symptoms.
>>> Maybe you have some pointers for me what to look into!
>>>
>>> App master log:
>>> —————————————
>>>
>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster 
>>> for application appattempt_1455209830330_35852_000001, versionInfo=[ 
>>> component=tez-dag, version=0.6.0, 
>>> revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, 
>>> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, 
>>> buildTime=2015-03-03T18:01:25Z ]
>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: 
>>> fs.default.name is deprecated. Instead, use fs.defaultFS
>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: 
>>> dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load 
>>> native-hadoop library for your platform... using builtin-java classes where 
>>> applicable
>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client 
>>> version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while 
>>> connecting to the server : 
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>  Operation category READ is not supported in state standby
>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token 
>>> to jobTokenSecretManager for application
>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.rm.container.AMContainerEventType for class 
>>> org.apache.tez.dag.app.rm.container.AMContainerMap
>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.rm.node.AMNodeEventType for class 
>>> org.apache.tez.dag.app.rm.node.AMNodeTracker
>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class 
>>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.DAGEventType for class 
>>> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.VertexEventType for class 
>>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.TaskEventType for class 
>>> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class 
>>> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
>>> org.apache.tez.dag.app.DAGAppMaster$4
>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
>>> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.rm.AMSchedulerEventType for class 
>>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering 
>>> class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class 
>>> org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: 
>>> blacklistDisablePercent is 33, blacklistingEnabled: true, 
>>> maxTaskFailuresPerNode: 10
>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper 
>>> limit on the thread pool size is 500
>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: 
>>> Initializing HistoryEventHandler
>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log 
>>> file location for SimpleHistoryLoggingService not specified, defaulting to 
>>> containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: 
>>> Initializing SimpleHistoryLoggingService, 
>>> logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001,
>>>  maxErrors=10
>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing 
>>> RecoveryService
>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: 
>>> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: 
>>> applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, 
>>> launchTime=1455642316658
>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: 
>>> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: 
>>> appAttemptId=appattempt_1455209830330_35852_000001, 
>>> appSubmitTime=1455642293317,
>>>
>>>
>>> … user specific code
>>>
>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] 
>>> app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): 
>>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
>>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.output.compress is deprecated. Instead, 
>>> use mapreduce.output.fileoutputformat.compress
>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: fs.default.name is deprecated. Instead, use 
>>> fs.defaultFS
>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.map.tasks.speculative.execution is 
>>> deprecated. Instead, use mapreduce.map.speculative
>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use 
>>> mapreduce.job.reduces
>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.min.split.size is deprecated. Instead, 
>>> use mapreduce.input.fileinputformat.split.minsize
>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.reduce.tasks.speculative.execution is 
>>> deprecated. Instead, use mapreduce.reduce.speculative
>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.output.compression.type is deprecated. 
>>> Instead, use mapreduce.output.fileoutputformat.compress.type
>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.max.split.size is deprecated. Instead, 
>>> use mapreduce.input.fileinputformat.split.maxsize
>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] 
>>> Configuration.deprecation: mapred.compress.map.output is deprecated. 
>>> Instead, use mapreduce.map.output.compress
>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] 
>>> app.DAGAppMaster: Generating DAG graphviz file, 
>>> dagId=dag_1455209830330_35852_1, 
>>> filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
>>> common.TezUtilsInternal: Redirecting log file based on addend: 
>>> dag_1455209830330_35852_1
>>>
>>> —————————————
>>>
>>> DAG-log:
>>> --------------------------
>>>
>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
>>> app.DAGAppMaster: Running DAG: Export job (349): 
>>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
>>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] 
>>> impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history 
>>> file
>>>
>>> … user specific code
>>>
>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job 
>>> (349): ClaimLineExportToDB2#export(Disconnected record stream) 
>>> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root 
>>> logger level for hadoop task to WARN:
>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] 
>>> hdfs.ClientContext: Existing client context 'default' does not match 
>>> requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, 
>>> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, 
>>> shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout 
>>> = 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity 
>>> = 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, 
>>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
>>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: 
>>> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 
>>> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 
>>> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, 
>>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, 
>>> socketCacheExpiry = 3000, shortCircuitLocalReads = true, 
>>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
>>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: 
>>> Exception encountered while connecting to the server : 
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>  Operation category READ is not supported in state standby
>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: 
>>> Exception encountered while connecting to the server : 
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>  Operation category READ is not supported in state standby
>>>
>>> --------------------------
>>>
>>>
>>> Latest line of the (single) task:
>>> --------------------------
>>>
>>> ...
>>>
>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] 
>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output using 
>>> OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, 
>>> outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] 
>>> runtime.LogicalIOProcessorRuntimeTask: Creating Input
>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] 
>>> runtime.LogicalIOProcessorRuntimeTask: Creating Output
>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] 
>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: 
>>> output
>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] 
>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: 
>>> Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record 
>>> stream)
>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] 
>>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: 
>>> INPUT, componentVertexName: Input:Export job (349): 
>>> ClaimLineExportToDB2#export(Disconnected record stream)
>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] 
>>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: 
>>> OUTPUT, componentVertexName: output
>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] 
>>> runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: 
>>> output
>>>
>>>
>>> … user specific code
>>>
>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root
>>> logger level for hadoop task to WARN:
>>>
>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered 
>>> while connecting to the server : 
>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>  Operation category READ is not supported in state standby
>>>
>>> —————————————
>>>
>>> The client log contains a shutdown exception:
>>> —————————————
>>>
>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - 
>>> Shutting down Tez Session, sessionName=Export job (349): 
>>> ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
>>> record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
>>> applicationId=application_1455209830330_35852
>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) 
>>> - Application not running, applicationId=application_1455209830330_35852, 
>>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
>>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, 
>>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, 
>>> killedDAGs=0
>>>
>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - 
>>> Failed to shutdown Tez Session via proxy
>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
>>> applicationId=application_1455209830330_35852, 
>>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
>>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/, 
>>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, 
>>> killedDAGs=0
>>>
>>>     at 
>>> org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>>     at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>>     at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>>     at 
>>> datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>>     at 
>>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>>     at 
>>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>>     at 
>>> datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>>     at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>     at 
>>> datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>>     at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>     at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>>     at 
>>> datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>>     at 
>>> datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>>     at 
>>> datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>>     at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>     at 
>>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>>     at 
>>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>>     at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - 
>>> Could not connect to AM, killing session via YARN, sessionName=Export job 
>>> (349): ClaimLineExportToDB2#export(Disconnected record 
>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
>>> applicationId=application_1455209830330_35852
>>>
>>>
>>> —————————————
>>>
>>>
>>> The counter and the event shows that:
>>> - the correct amount of input records has been ingested
>>> - the expected amount of output records is missing (we have zero records
>>> produced which is reflected in both, the counters and the output itself)
>>> - for Tez the task and the DAG are both handled as successful
>>>
>>> The customers says that the symptoms of the job are always the same:
>>> - there is that standby warning
>>> - there is the problem with the shutdown
>>> - (and the data is missing)
>>>
>>> To me its still unclear if there are strange things going on with Tez or
>>> with our application logic.
>>> The same job is successful when executed with pure MapReduce.
>>> Is the standby thing something to worry about ?
>>>
>>> best
>>> Johannes
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>


-- 
Best Regards

Jeff Zhang

Reply via email to