[ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297577#comment-16297577
 ] 

Sergey Shelukhin commented on TEZ-3879:
---------------------------------------

[~sseth] [~ewohlstadter] can you take a look?

> potential abort propagation issue (race?)
> -----------------------------------------
>
>                 Key: TEZ-3879
>                 URL: https://issues.apache.org/jira/browse/TEZ-3879
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption). 
> On the nodes, the abort is handled by TezTaskRunner2 and it looks like 
> there's some race there for some cases.
> Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
> does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_000012_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_000012_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_000012_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_000012_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_000012_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to