Re: Strange behaviour of session DAG

Johannes Zillmann Mon, 29 Feb 2016 06:09:44 -0800

@Jeff, yes its a custom input-initialiser with custom format getting the 
file-system via FileSystem.get(uri, con).


thanks for the insight! Looking into this!
Johannes

> On 29 Feb 2016, at 13:26, Jeff Zhang <[email protected]> wrote:
> 
> It looks like the issue is in MapChainVertex's InputInitializer.  It time out 
> after 3 minutes due to the HA issue.  Do they implement a custom 
> InputInitializer or using a custom InputFormat ? If they use a custom 
> InputInitializer then it might be the user code issue. 
> 
> 
> 2016-02-16 12:05:21,435 INFO [InputInitializer [MapChainVertex:Export job 
> (349): ClaimLineExportToDB2#export(Disconnected record stream) 
> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] graphv2.ClusterTaskRuntime: 
> Initializing cluster task configuration...
> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job 
> (349): ClaimLineExportToDB2#export(Disconnected record stream) 
> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root 
> logger level for hadoop task to WARN:
> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] 
> hdfs.ClientContext: Existing client context 'default' does not match 
> requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, 
> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 256, 
> shortCircuitMmapCacheExpiryMs = 3600000, shortCircuitMmapCacheRetryTimeout = 
> 300000, shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 
> 16, socketCacheExpiry = 3000, shortCircuitLocalReads = false, 
> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: 
> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 
> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 
> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, 
> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, 
> socketCacheExpiry = 3000, shortCircuitLocalReads = true, 
> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: 
> Exception encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> 2016-02-16 12:08:12,515 WARN [AsyncDispatcher event handler] 
> filesystem.FileSystemNirvana: No file-system provider configuration setup!
> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: 
> Exception encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> 
> On Mon, Feb 29, 2016 at 8:14 PM, Johannes Zillmann <[email protected] 
> <mailto:[email protected]>> wrote:
> Here you go:
> 
> 
> Johannes 
> 
>> On 26 Feb 2016, at 02:56, Jeff Zhang <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Is it possible for you to get the whole app log from your customer ?
>> 
>> On Thu, Feb 25, 2016 at 6:26 PM, Johannes Zillmann <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Looks like there is the same warning in MR as well:
>> 
>> 016-02-23 11:59:24,568 WARN [main] org.apache.hadoop.ipc.Client: Exception 
>> encountered while connecting to the server : 
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>  Operation category READ is not supported in state standby
>> 
>> But in this case it doesn’t seem to have a bad effect…
>> Johannes
>> 
>>> On 25 Feb 2016, at 10:46, Jeff Zhang <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Might be, does only tez job affected ? What about other jobs like MR ?
>>> 
>>> On Thu, Feb 25, 2016 at 5:32 PM, Johannes Zillmann 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> So yes, i think HA is enabled.
>>> But this whole behaviour is not a one shot thing, its happen repeatedly for 
>>> this job. So my guess is its rather a unhappy configuration then the name 
>>> node switching again and again !?
>>> 
>>> Johannes
>>> 
>>> 
>>>> On 25 Feb 2016, at 10:29, Jeff Zhang <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> The standby warning is because it try to connect to the standby node. (HA 
>>>> should be enabled I think).  Maybe this cause the output missing, could 
>>>> you attach the whole app log ?
>>>> 
>>>> On Thu, Feb 25, 2016 at 5:03 PM, Johannes Zillmann 
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> Hey guys,
>>>> 
>>>> have a Tez job on a customer deployment which has very strange symptoms. 
>>>> Maybe you have some pointers for me what to look into!
>>>> 
>>>> App master log:
>>>> —————————————
>>>> 2016-02-16 12:05:16,666 INFO [main] app.DAGAppMaster: Created DAGAppMaster 
>>>> for application appattempt_1455209830330_35852_000001, versionInfo=[ 
>>>> component=tez-dag, version=0.6.0, 
>>>> revision=25e41bc0f29f5e25adbfc17258b00ad702a17888, 
>>>> SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git 
>>>> <http://git-wip-us.apache.org/repos/asf/tez.git>, 
>>>> buildTime=2015-03-03T18:01:25Z ]
>>>> 2016-02-16 12:05:16,807 INFO [main] Configuration.deprecation: 
>>>> fs.default.name <http://fs.default.name/> is deprecated. Instead, use 
>>>> fs.defaultFS
>>>> 2016-02-16 12:05:16,816 INFO [main] Configuration.deprecation: 
>>>> dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
>>>> 2016-02-16 12:05:16,987 WARN [main] util.NativeCodeLoader: Unable to load 
>>>> native-hadoop library for your platform... using builtin-java classes 
>>>> where applicable
>>>> 2016-02-16 12:05:17,134 INFO [main] app.DAGAppMaster: Comparing client 
>>>> version with AM version, clientVersion=0.6.0, AMVersion=0.6.0
>>>> 2016-02-16 12:05:18,158 WARN [main] ipc.Client: Exception encountered 
>>>> while connecting to the server : 
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>>  Operation category READ is not supported in state standby
>>>> 2016-02-16 12:05:18,298 INFO [main] app.DAGAppMaster: Adding session token 
>>>> to jobTokenSecretManager for application
>>>> 2016-02-16 12:05:18,305 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.rm.container.AMContainerEventType for class 
>>>> org.apache.tez.dag.app.rm.container.AMContainerMap
>>>> 2016-02-16 12:05:18,306 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.rm.node.AMNodeEventType for class 
>>>> org.apache.tez.dag.app.rm.node.AMNodeTracker
>>>> 2016-02-16 12:05:18,308 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.DAGAppMasterEventType for class 
>>>> org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterEventHandler
>>>> 2016-02-16 12:05:18,309 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.DAGEventType for class 
>>>> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher
>>>> 2016-02-16 12:05:18,310 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.VertexEventType for class 
>>>> org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher
>>>> 2016-02-16 12:05:18,311 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.TaskEventType for class 
>>>> org.apache.tez.dag.app.DAGAppMaster$TaskEventDispatcher
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.TaskAttemptEventType for class 
>>>> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
>>>> org.apache.tez.dag.app.DAGAppMaster$4
>>>> 2016-02-16 12:05:18,312 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.dag.event.SpeculatorEventType for class 
>>>> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler
>>>> 2016-02-16 12:05:18,357 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.rm.AMSchedulerEventType for class 
>>>> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler
>>>> 2016-02-16 12:05:18,359 INFO [main] event.AsyncDispatcher: Registering 
>>>> class org.apache.tez.dag.app.rm.NMCommunicatorEventType for class 
>>>> org.apache.tez.dag.app.launcher.ContainerLauncherImpl
>>>> 2016-02-16 12:05:18,483 INFO [main] node.AMNodeTracker: 
>>>> blacklistDisablePercent is 33, blacklistingEnabled: true, 
>>>> maxTaskFailuresPerNode: 10
>>>> 2016-02-16 12:05:18,485 INFO [main] launcher.ContainerLauncherImpl: Upper 
>>>> limit on the thread pool size is 500
>>>> 2016-02-16 12:05:18,485 INFO [main] history.HistoryEventHandler: 
>>>> Initializing HistoryEventHandler
>>>> 2016-02-16 12:05:18,489 INFO [main] impl.SimpleHistoryLoggingService: Log 
>>>> file location for SimpleHistoryLoggingService not specified, defaulting to 
>>>> containerLogDir=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001
>>>> 2016-02-16 12:05:18,490 INFO [main] impl.SimpleHistoryLoggingService: 
>>>> Initializing SimpleHistoryLoggingService, 
>>>> logFileLocation=/u10/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/history.txt.appattempt_1455209830330_35852_000001,
>>>>  maxErrors=10
>>>> 2016-02-16 12:05:18,490 INFO [main] recovery.RecoveryService: Initializing 
>>>> RecoveryService
>>>> 2016-02-16 12:05:18,491 INFO [main] history.HistoryEventHandler: 
>>>> [HISTORY][DAG:N/A][Event:APP_LAUNCHED]: 
>>>> applicationId=application_1455209830330_35852, 
>>>> appSubmitTime=1455642293317, launchTime=1455642316658
>>>> 2016-02-16 12:05:18,492 INFO [main] history.HistoryEventHandler: 
>>>> [HISTORY][DAG:N/A][Event:AM_LAUNCHED]: 
>>>> appAttemptId=appattempt_1455209830330_35852_000001, 
>>>> appSubmitTime=1455642293317, 
>>>> 
>>>> … user specific code
>>>> 
>>>> 2016-02-16 12:05:19,343 INFO [IPC Server handler 0 on 13239] 
>>>> app.DAGAppMaster: Starting DAG submitted via RPC: Export job (349): 
>>>> ClaimLineExportToDB2#export(Disconnected record 
>>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.output.compress is deprecated. Instead, 
>>>> use mapreduce.output.fileoutputformat.compress
>>>> 2016-02-16 12:05:19,347 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: fs.default.name <http://fs.default.name/> is 
>>>> deprecated. Instead, use fs.defaultFS
>>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.map.tasks.speculative.execution is 
>>>> deprecated. Instead, use mapreduce.map.speculative
>>>> 2016-02-16 12:05:19,348 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use 
>>>> mapreduce.job.reduces
>>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.min.split.size is deprecated. Instead, 
>>>> use mapreduce.input.fileinputformat.split.minsize
>>>> 2016-02-16 12:05:19,349 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.reduce.tasks.speculative.execution is 
>>>> deprecated. Instead, use mapreduce.reduce.speculative
>>>> 2016-02-16 12:05:19,351 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.output.compression.type is deprecated. 
>>>> Instead, use mapreduce.output.fileoutputformat.compress.type
>>>> 2016-02-16 12:05:19,352 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.max.split.size is deprecated. Instead, 
>>>> use mapreduce.input.fileinputformat.split.maxsize
>>>> 2016-02-16 12:05:19,354 INFO [IPC Server handler 0 on 13239] 
>>>> Configuration.deprecation: mapred.compress.map.output is deprecated. 
>>>> Instead, use mapreduce.map.output.compress
>>>> 2016-02-16 12:05:19,480 INFO [IPC Server handler 0 on 13239] 
>>>> app.DAGAppMaster: Generating DAG graphviz file, 
>>>> dagId=dag_1455209830330_35852_1, 
>>>> filePath=/u01/hadoop/yarn/log/application_1455209830330_35852/container_1455209830330_35852_01_000001/dag_1455209830330_35852_1.dot
>>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
>>>> common.TezUtilsInternal: Redirecting log file based on addend: 
>>>> dag_1455209830330_35852_1
>>>> —————————————
>>>> 
>>>> DAG-log:
>>>> --------------------------
>>>> 2016-02-16 12:05:19,482 INFO [IPC Server handler 0 on 13239] 
>>>> app.DAGAppMaster: Running DAG: Export job (349): 
>>>> ClaimLineExportToDB2#export(Disconnected record 
>>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf)
>>>> 2016-02-16 12:05:19,966 INFO [HistoryEventHandlingThread] 
>>>> impl.SimpleHistoryLoggingService: Writing event DAG_SUBMITTED to history 
>>>> file
>>>> … user specific code
>>>> 
>>>> 2016-02-16 12:05:21,437 INFO [InputInitializer [MapChainVertex:Export job 
>>>> (349): ClaimLineExportToDB2#export(Disconnected record stream) 
>>>> (2b86fca4-215f-47ba-9303-cbd2ad9d676e)] #0] util.LoggingUtil: Setting root 
>>>> logger level for hadoop task to WARN: 
>>>> 2016-02-16 12:05:21,722 WARN [AsyncDispatcher event handler] 
>>>> hdfs.ClientContext: Existing client context 'default' does not match 
>>>> requested configuration.  Existing: shortCircuitStreamsCacheSize = 256, 
>>>> shortCircuitStreamsCacheExpiryMs = 300000, shortCircuitMmapCacheSize = 
>>>> 256, shortCircuitMmapCacheExpiryMs = 3600000, 
>>>> shortCircuitMmapCacheRetryTimeout = 300000, 
>>>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, 
>>>> socketCacheExpiry = 3000, shortCircuitLocalReads = false, 
>>>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
>>>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000, Requested: 
>>>> shortCircuitStreamsCacheSize = 4096, shortCircuitStreamsCacheExpiryMs = 
>>>> 300000, shortCircuitMmapCacheSize = 256, shortCircuitMmapCacheExpiryMs = 
>>>> 3600000, shortCircuitMmapCacheRetryTimeout = 300000, 
>>>> shortCircuitCacheStaleThresholdMs = 1800000, socketCacheCapacity = 16, 
>>>> socketCacheExpiry = 3000, shortCircuitLocalReads = true, 
>>>> useLegacyBlockReaderLocal = false, domainSocketDataTraffic = false, 
>>>> shortCircuitSharedMemoryWatcherInterruptCheckMs = 60000
>>>> 2016-02-16 12:05:21,732 WARN [AsyncDispatcher event handler] ipc.Client: 
>>>> Exception encountered while connecting to the server : 
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>>  Operation category READ is not supported in state standby
>>>> 2016-02-16 12:08:12,554 WARN [AsyncDispatcher event handler] ipc.Client: 
>>>> Exception encountered while connecting to the server : 
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>>  Operation category READ is not supported in state standby
>>>> --------------------------
>>>> 
>>>> 
>>>> Latest line of the (single) task:
>>>> --------------------------
>>>> ...
>>>> 2016-02-16 12:07:00,166 INFO [Initializer 1] 
>>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output using 
>>>> OutputSpec: { destinationVertexName=output, physicalEdgeCount=0, 
>>>> outputClassName=datameer.plugin.tez.output.DoNothingOutput }
>>>> 2016-02-16 12:07:00,168 INFO [Initializer 0] 
>>>> runtime.LogicalIOProcessorRuntimeTask: Creating Input
>>>> 2016-02-16 12:07:00,168 INFO [Initializer 1] 
>>>> runtime.LogicalIOProcessorRuntimeTask: Creating Output
>>>> 2016-02-16 12:07:00,175 INFO [Initializer 1] 
>>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Output with dest edge: 
>>>> output
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] 
>>>> runtime.LogicalIOProcessorRuntimeTask: Initializing Input with src edge: 
>>>> Input:Export job (349): ClaimLineExportToDB2#export(Disconnected record 
>>>> stream)
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 0] 
>>>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: 
>>>> INPUT, componentVertexName: Input:Export job (349): 
>>>> ClaimLineExportToDB2#export(Disconnected record stream)
>>>> 2016-02-16 12:07:00,176 INFO [Initializer 1] 
>>>> resources.MemoryDistributor$RequestorInfo: Received request: 0, type: 
>>>> OUTPUT, componentVertexName: output
>>>> 2016-02-16 12:07:00,177 INFO [Initializer 1] 
>>>> runtime.LogicalIOProcessorRuntimeTask: Initialized Output with dest edge: 
>>>> output
>>>> 
>>>> … user specific code
>>>> 
>>>> 2016-02-16 12:07:01,871 INFO [TezChild] util.LoggingUtil: Setting root 
>>>> logger level for hadoop task to WARN: 
>>>> 2016-02-16 12:07:03,990 WARN [TezChild] ipc.Client: Exception encountered 
>>>> while connecting to the server : 
>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>>>>  Operation category READ is not supported in state standby
>>>> —————————————
>>>> 
>>>> The client log contains a shutdown exception:
>>>> —————————————
>>>> INFO [2016-02-16 12:13:30.412] [MrPlanRunnerV2] (TezClient.java:470) - 
>>>> Shutting down Tez Session, sessionName=Export job (349): 
>>>> ClaimLineExportToDB2#export(Disconnected record 
>>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
>>>> applicationId=application_1455209830330_35852
>>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClientUtils.java:838) 
>>>> - Application not running, applicationId=application_1455209830330_35852, 
>>>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
>>>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ 
>>>> <http://xxx:8088/proxy/application_1455209830330_35852/>, 
>>>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, 
>>>> killedDAGs=0
>>>> 
>>>>  INFO [2016-02-16 12:13:30.581] [MrPlanRunnerV2] (TezClient.java:484) - 
>>>> Failed to shutdown Tez Session via proxy
>>>> org.apache.tez.dag.api.SessionNotRunning: Application not running, 
>>>> applicationId=application_1455209830330_35852, 
>>>> yarnApplicationState=FINISHED, finalApplicationStatus=SUCCEEDED, 
>>>> trackingUrl=http://xxx:8088/proxy/application_1455209830330_35852/ 
>>>> <http://xxx:8088/proxy/application_1455209830330_35852/>, 
>>>> diagnostics=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, 
>>>> killedDAGs=0
>>>> 
>>>>    at 
>>>> org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:839)
>>>>    at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:669)
>>>>    at org.apache.tez.client.TezClient.stop(TezClient.java:476)
>>>>    at 
>>>> datameer.plugin.tez.session.TezSessionImpl.close(TezSessionImpl.java:69)
>>>>    at 
>>>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSession(PoolingTezSessionFactory.java:173)
>>>>    at 
>>>> datameer.plugin.tez.session.PoolingTezSessionFactory.shutdownSessions(PoolingTezSessionFactory.java:167)
>>>>    at 
>>>> datameer.plugin.tez.session.PoolingTezSessionFactory.close(PoolingTezSessionFactory.java:162)
>>>>    at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>>    at 
>>>> datameer.plugin.tez.smart.SmartClusterJobFlowResolver.close(SmartClusterJobFlowResolver.java:147)
>>>>    at datameer.com.google.common.io.Closer.close(Closer.java:214)
>>>>    at datameer.dap.sdk.util.IoUtil.close(IoUtil.java:330)
>>>>    at 
>>>> datameer.dap.common.graphv2.mixedframework.MixedClusterSession.commitJob(MixedClusterSession.java:65)
>>>>    at 
>>>> datameer.dap.common.graphv2.ClusterSession.commitJob(ClusterSession.java:81)
>>>>    at 
>>>> datameer.dap.common.graphv2.MrPlanRunnerV2.run(MrPlanRunnerV2.java:97)
>>>>    at 
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>    at 
>>>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:96)
>>>>    at 
>>>> datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:173)
>>>>    at datameer.dap.common.security.RunAsThread.run(RunAsThread.java:22)
>>>>  INFO [2016-02-16 12:13:30.582] [MrPlanRunnerV2] (TezClient.java:489) - 
>>>> Could not connect to AM, killing session via YARN, sessionName=Export job 
>>>> (349): ClaimLineExportToDB2#export(Disconnected record 
>>>> stream)#export(Disconnected record (d4e9fda0-0f6c-4e63-8d5b-e846334affdf), 
>>>> applicationId=application_1455209830330_35852
>>>> 
>>>> —————————————
>>>> 
>>>> 
>>>> The counter and the event shows that:
>>>> - the correct amount of input records has been ingested
>>>> - the expected amount of output records is missing (we have zero records 
>>>> produced which is reflected in both, the counters and the output itself)
>>>> - for Tez the task and the DAG are both handled as successful
>>>> 
>>>> The customers says that the symptoms of the job are always the same:
>>>> - there is that standby warning
>>>> - there is the problem with the shutdown
>>>> - (and the data is missing)
>>>> 
>>>> To me its still unclear if there are strange things going on with Tez or 
>>>> with our application logic.
>>>> The same job is successful when executed with pure MapReduce.
>>>> Is the standby thing something to worry about ?
>>>> 
>>>> best
>>>> Johannes
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards
>>>> 
>>>> Jeff Zhang
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards
>>> 
>>> Jeff Zhang
>> 
>> 
>> 
>> 
>> -- 
>> Best Regards
>> 
>> Jeff Zhang
> 
> 
> 
> 
> 
> -- 
> Best Regards
> 
> Jeff Zhang

Re: Strange behaviour of session DAG

Reply via email to