[jira] [Updated] (TEZ-3879) potential abort propagation issue (race?)

Sergey Shelukhin (JIRA) Tue, 19 Dec 2017 15:18:24 -0800

     [ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated TEZ-3879:
----------------------------------
    Description: 
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_000012_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_000012_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_000012_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_000012_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_000012_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_000012_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}

  was:
I'm looking at a Hive query where AM aborts some tasks for whatever reason (AM 
preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_000012_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_000012_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_000012_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_000012_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_000012_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_000012_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}


> potential abort propagation issue (race?)
> -----------------------------------------
>
>                 Key: TEZ-3879
>                 URL: https://issues.apache.org/jira/browse/TEZ-3879
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption).
> On the nodes, most tasks receive abort normally, like so (note that 
> TezProcessor is part of Hive and the main class that Tez code calls; the 
> first thing it does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_000012_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_000012_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via 
> task heartbeat: attempt_1513367667720_3619_3_02_000012_0
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_000012_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_000012_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_000012_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (TEZ-3879) potential abort propagation issue (race?)

Reply via email to