Hey guys,

have a Tez job on a customer deployment which has very strange symptoms. Maybe 
you have some pointers for me what to look into!

App master log:
—————————————
2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster for 
application appattempt_1455209830330_35852_000001, versionInfo=[ 
component=tez-dag, version=0.6.0, 
revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, 
SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git 
<http://git-wip-us.apache.org/repos/asf/tez.git>, 
buildTime=2015-03-03T18:01:25Z ]
2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: fs.default.name 
is deprecated. Instead, use fs.defaultFS
2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: dfs.umaskmode is 
deprecated. Instead, use fs.permissions.umask-mode
2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client version 
with AM version, clientVersion=0.6.0, AMVersion=0.6.0
2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered while 
connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category READ is not supported in state standby
2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token to 
jobTokenSecretManager for application
2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.rm.container.AMContainerEventType for class 
org.apache.tez.dag.app.rm.container.AMContainerMap
2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.rm.node.AMNodeEventType for class 
org.apache.tez.dag.app.rm.node.AMNodeTracker
2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class 
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.DAGEventType for class 
org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.VertexEventType for class 
org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.TaskEventType for class 
org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class 
org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
org.apache.tez.dag.app.DAGAppMaster$4
2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.rm.AMSchedulerEventType for class 
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering class 
org.apache.tez.dag.app.rm.NMCommunicatorEventType for class 
org.apache.tez.dag.app.launcher.ContainerLauncherImpl
2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: blacklistDisablePercent 
is 33, blacklistingEnabled: true, maxTaskFailuresPerNode: 10
2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper limit 
on the thread pool size is 500
2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: Initializing 
HistoryEventHandler
2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log file 
location for SimpleHistoryLoggingService not specified, defaulting to 
containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: 
Initializing SimpleHistoryLoggingService, 
logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001,
 maxErrors=10
2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing 
RecoveryService
2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: 
[HISTORY][DAG:N/A][Event:APP_LAUNCHED]: 
applicationId=application_1455209830330_35852, appSubmitTime=1455642293317, 
launchTime=1455642316658
2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: 
[HISTORY][DAG:N/A][Event:AM_LAUNCHED]: 
appAttemptId=appattempt_1455209830330_35852_000001, 
appSubmitTime=1455642293317, 

… user specific code

2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: 
Starting DAG submitted via RPC: Export job (349): 
ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.output.compress is deprecated. Instead, use 
mapreduce.output.fileoutputformat.compress
2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: fs.default.name is deprecated. Instead, use 
fs.defaultFS
2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.map.tasks.speculative.execution is 
deprecated. Instead, use mapreduce.map.speculative
2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use 
mapreduce.job.reduces
2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.minsize
2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.reduce.tasks.speculative.execution is 
deprecated. Instead, use mapreduce.reduce.speculative
2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.output.compression.type is deprecated. 
Instead, use mapreduce.output.fileoutputformat.compress.type
2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use 
mapreduce.input.fileinputformat.split.maxsize
2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] 
Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, 
use mapreduce.map.output.compress
2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: 
Generating DAG graphviz file, dagId=dag_1455209830330_35852_1, 
filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
common.TezUtilsInternal: Redirecting log file based on addend: 
dag_1455209830330_35852_1
—————————————

DAG-log:
--------------------------
2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] app.DAGAppMaster: 
Running DAG: Export job (349): ClaimLineExportToDB2#export(Disconnected record 
stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] 
impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history file
… user specific code

2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job 
(349): ClaimLineExportToDB2#export(Disconnected record stream) 
(2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root 
logger level for hadoop task to WARN: 
2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] 
hdfs.ClientContext: Existing client context 'default' does not match requested 
configuration.  Existing: shortCircuitStreamsCacheSize = 256, 
shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, 
shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 
300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, 
socketCacheExpiry = 3000, shortCircuitLocalReads = false, 
useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: 
shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 300000, 
shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 3600000, 
shortCircuitMmapCacheRetryTimeout = 300000, shortCircuitCacheStaleThresholdMs = 
1800000, socketCacheCapacity = 16, socketCacheExpiry = 3000, 
shortCircuitLocalReads = true, useLegacyBlockReaderLocal = false, 
domainSocketDataTraffic = false, 
shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: 
Exception encountered while connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category READ is not supported in state standby
2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: 
Exception encountered while connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category READ is not supported in state standby
--------------------------


Latest line of the (single) task:
--------------------------
...
2016-02-16 12:07:00,166 INFO [Initializer 1] 
runtime.LogicalIOProcessorRuntimeTask: Initializing Output using OutputSpec: { 
destinationVertexName=output, physicalEdgeCount=0, 
outputClassName=datameer.plugin.tez.output.DoNothingOutput }
2016-02-16 12:07:00,168 INFO [Initializer 0] 
runtime.LogicalIOProcessorRuntimeTask: Creating Input
2016-02-16 12:07:00,168 INFO [Initializer 1] 
runtime.LogicalIOProcessorRuntimeTask: Creating Output
2016-02-16 12:07:00,175 INFO [Initializer 1] 
runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: 
output
2016-02-16 12:07:00,176 INFO [Initializer 0] 
runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: 
Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record stream)
2016-02-16 12:07:00,176 INFO [Initializer 0] 
resources.MemoryDistributor$RequestorInfo: Received request: 0, type: INPUT, 
componentVertexName: Input:Export job (349): 
ClaimLineExportToDB2#export(Disconnected record stream)
2016-02-16 12:07:00,176 INFO [Initializer 1] 
resources.MemoryDistributor$RequestorInfo: Received request: 0, type: OUTPUT, 
componentVertexName: output
2016-02-16 12:07:00,177 INFO [Initializer 1] 
runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: output

… user specific code

2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root logger 
level for hadoop task to WARN: 
2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered while 
connecting to the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category READ is not supported in state standby
—————————————

The client log contains a shutdown exception:
—————————————
INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - Shutting 
down Tez Session, sessionName=Export job (349): 
ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
applicationId=application_1455209830330_35852
 INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) - 
Application not running, applicationId=application_1455209830330_35852, 
yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ 
<http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session 
stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0

 INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - Failed 
to shutdown Tez Session via proxy
org.apache.tez.dag.api.SessionNotRunning: Application not running, 
applicationId=application_1455209830330_35852, yarnApplicationState=FINISHED, 
finalApplicationStatus=SUCCEEDED, 
trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ 
<http://xxx:8088/proxy/application_1455209830330_35852/>, diagnostics=Session 
stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0

        at 
org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
        at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
        at org.apache.tez.client.TezClient.stop(TezClient.java:476)
        at 
datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
        at 
datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
        at 
datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
        at 
datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
        at datameer.com.google.common.io.Closer.close(Closer.java:214)
        at 
datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
        at datameer.com.google.common.io.Closer.close(Closer.java:214)
        at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
        at 
datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
        at 
datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
        at 
datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at 
datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
        at 
datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
        at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
 INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - Could 
not connect to AM, killing session via YARN, sessionName=Export job (349): 
ClaimLineExportToDB2#export(Disconnected record stream)#export(Disconnected 
record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
applicationId=application_1455209830330_35852

—————————————


The counter and the event shows that:
- the correct amount of input records has been ingested
- the expected amount of output records is missing (we have zero records 
produced which is reflected in both, the counters and the output itself)
- for Tez the task and the DAG are both handled as successful

The customers says that the symptoms of the job are always the same:
- there is that standby warning
- there is the problem with the shutdown
- (and the data is missing)

To me its still unclear if there are strange things going on with Tez or with 
our application logic.
The same job is successful when executed with pure MapReduce.
Is the standby thing something to worry about ?

best
Johannes

Reply via email to