[ https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297577#comment-16297577 ]
Sergey Shelukhin commented on TEZ-3879: --------------------------------------- [~sseth] [~ewohlstadter] can you take a look? > potential abort propagation issue (race?) > ----------------------------------------- > > Key: TEZ-3879 > URL: https://issues.apache.org/jira/browse/TEZ-3879 > Project: Apache Tez > Issue Type: Bug > Reporter: Sergey Shelukhin > > I'm looking at a Hive LLAP query where AM aborts some tasks for whatever > reason (AM preemption). > On the nodes, the abort is handled by TezTaskRunner2 and it looks like > there's some race there for some cases. > Most tasks receive abort normally, like so (the first thing Hive TezProcessor > does on any abort is log "Received abort"). > {noformat} > 2017-12-18T14:44:26,616 INFO [TaskHeartbeatThread ()] > org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort > attempt_1513367667720_3619_1_02_000012_0 due to an invocation of > shutdownRequested > 2017-12-18T14:44:26,621 INFO [TaskHeartbeatThread ()] > org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort > 2017-12-18T14:44:26,622 INFO [TaskHeartbeatThread ()] > org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to > RecordProcessor > 2017-12-18T14:44:26,622 INFO [TaskHeartbeatThread ()] > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to > mapOp: {} MAP > {noformat} > However on some tasks that are terminated shortly after init, TezProcessor is > never called. Moreover, when AM tries to kill the task again (when it's > already running, having ignored the abort) Tez says the task is already > aborted and doesn't propagate this either. > {noformat} > 2017-12-18T14:47:22,995 INFO [TezTR-667720_3619_3_2_12_0 > (1513367667720_3619_3_02_000012_0)] > reducesink.VectorReduceSinkCommonOperator: Using tag = -1 > (this is the end of Hive init) > ... > 2017-12-18T14:47:23,133 INFO [TaskHeartbeatThread ()] > org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort > attempt_1513367667720_3619_3_02_000012_0 due to an invocation of > shutdownRequested > (no TezProcessor log statements) > {noformat} > The task keeps running and the next kill is ignored > {noformat} > 2017-12-18T14:47:23,575 INFO [IPC Server handler 2 on 40617 ()] > org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received > terminateFragment request for attempt_1513367667720_3619_3_02_000012_0 > ... > 2017-12-18T14:47:23,575 INFO [IPC Server handler 2 on 40617 ()] > org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since > the task with id attempt_1513367667720_3619_3_02_000012_0 has ended for > reason: CONTAINER_STOP_REQUESTED. IgnoredError: > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)