[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296958#comment-16296958
 ] 

Jason Lowe commented on TEZ-3770:
-

If this is failing on vanilla 0.9.1 then I think this warrants a separate JIRA 
since the problem occurs without the patch from this JIRA.

I suspect the issue doesn't happen when running with the dag-aware scheduler 
since it avoids preempting tasks that are not descendants of the pending 
requests in the DAG.  If the DAG has many independent trees (or even 
significant, parallel branches within a large tree) then the default YARN 
scheduler can preempt tasks unnecessarily.  Since it doesn't understand the DAG 
connections it makes the false assumption that any lower priority vertex must 
be a DAG descendant of a higher priority vertex.  It then preempts those lower 
priority tasks when the high priority requests are not being fulfilled.  The 
DAG-aware scheduler does not preempt active, low priority tasks if they are not 
descendants of the higher priority requests since those lower priority tasks 
can still complete and we don't want to lose completed work.

I'll try to get an updated patch addressing the javadoc comment request today.  
After that I think this should be ready to go in since it works on "real" jobs 
like TPC-DS Q64 10TB with a noticeable improvement over the current YARN 
scheduler, and it's "opt-in" so users must explicitly configure it.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3810) TezCounter for idle time in shuffle phase

2017-12-19 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated TEZ-3810:
-
Attachment: TEZ-3810.005.patch

Updating patch based on review comments from Jason. Will wait for the precommit 
before asking for more reviews.

> TezCounter for idle time in shuffle phase
> -
>
> Key: TEZ-3810
> URL: https://issues.apache.org/jira/browse/TEZ-3810
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ashwin Ramesh
> Attachments: TEZ-3810-001.patch, TEZ-3810.002.patch, 
> TEZ-3810.003.patch, TEZ-3810.004.patch, TEZ-3810.005.patch
>
>
>  A task attempt counter that tracks how much time was spent waiting for 
> inputs in the shuffle phase. We can use this to quickly identify jobs that 
> are wasting a lot of time on the grid with idle reducer tasks instead of 
> shuffling/merging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3810) TezCounter for idle time in shuffle phase

2017-12-19 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297208#comment-16297208
 ] 

TezQA commented on TEZ-3810:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12902897/TEZ-3810.005.patch
  against master revision 4c378b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2705//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2705//console

This message is automatically generated.

> TezCounter for idle time in shuffle phase
> -
>
> Key: TEZ-3810
> URL: https://issues.apache.org/jira/browse/TEZ-3810
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ashwin Ramesh
> Attachments: TEZ-3810-001.patch, TEZ-3810.002.patch, 
> TEZ-3810.003.patch, TEZ-3810.004.patch, TEZ-3810.005.patch
>
>
>  A task attempt counter that tracks how much time was spent waiting for 
> inputs in the shuffle phase. We can use this to quickly identify jobs that 
> are wasting a lot of time on the grid with idle reducer tasks instead of 
> shuffling/merging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Success: TEZ-3810 PreCommit Build #2705

2017-12-19 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3810
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2705/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 339.65 KB...]
[INFO] Tez  SUCCESS [  0.020 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 54:22 min
[INFO] Finished at: 2017-12-19T18:32:35Z
[INFO] Final Memory: 93M/1422M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12902897/TEZ-3810.005.patch
  against master revision 4c378b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2705//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2705//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
ef26ad023a5d75d49d3caf04d27fffd39b1e3744 logged out


==
==
Finished build.
==
==


Archiving artifacts
[description-setter] Description set: TEZ-3810
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Assigned] (TEZ-3810) TezCounter for idle time in shuffle phase

2017-12-19 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla reassigned TEZ-3810:


Assignee: Kuhu Shukla

> TezCounter for idle time in shuffle phase
> -
>
> Key: TEZ-3810
> URL: https://issues.apache.org/jira/browse/TEZ-3810
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ashwin Ramesh
>Assignee: Kuhu Shukla
> Attachments: TEZ-3810-001.patch, TEZ-3810.002.patch, 
> TEZ-3810.003.patch, TEZ-3810.004.patch, TEZ-3810.005.patch
>
>
>  A task attempt counter that tracks how much time was spent waiting for 
> inputs in the shuffle phase. We can use this to quickly identify jobs that 
> are wasting a lot of time on the grid with idle reducer tasks instead of 
> shuffling/merging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-19 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-3770:

Attachment: TEZ-3770.002.patch

Attaching a patch that adds comments elaborating on how vertex indices are used 
to build bitmasks of decendant vertices for a particular vertex and how those 
bitmasks are used in calculating which vertices are allowed or disallowed from 
having their requests allocated at a particular priority based on what 
ancestors at higher priorities are requesting.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch, TEZ-3770.002.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Success: TEZ-3770 PreCommit Build #2706

2017-12-19 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3770
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2706/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 340.79 KB...]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 01:00 h
[INFO] Finished at: 2017-12-19T20:59:27Z
[INFO] Final Memory: 80M/1411M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12902920/TEZ-3770.002.patch
  against master revision 4c378b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2706//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2706//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
f98c6ffdf447bc5bf8bb6b21d9b120360f907e68 logged out


==
==
Finished build.
==
==


Archiving artifacts
[Fast Archiver] Compressed 3.52 MB of artifacts by 25.7% relative to #2705
[description-setter] Description set: TEZ-3770
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-19 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297416#comment-16297416
 ] 

TezQA commented on TEZ-3770:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12902920/TEZ-3770.002.patch
  against master revision 4c378b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2706//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2706//console

This message is automatically generated.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch, TEZ-3770.002.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3879:
-

 Summary: potential abort propagation issue (race?)
 Key: TEZ-3879
 URL: https://issues.apache.org/jira/browse/TEZ-3879
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


I'm looking at a Hive query where AM aborts some tasks for whatever reason (AM 
preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3879:
--
Description: 
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}

  was:
I'm looking at a Hive query where AM aborts some tasks for whatever reason (AM 
preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredEr

[jira] [Updated] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3879:
--
Description: 
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption). 
On the nodes, the abort is handled by TezTaskRunner2 and it looks like there's 
some race there for some cases.
Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}

  was:
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}


> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
>   

[jira] [Commented] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297577#comment-16297577
 ] 

Sergey Shelukhin commented on TEZ-3879:
---

[~sseth] [~ewohlstadter] can you take a look?

> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
> URL: https://issues.apache.org/jira/browse/TEZ-3879
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption). 
> On the nodes, the abort is handled by TezTaskRunner2 and it looks like 
> there's some race there for some cases.
> Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
> does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_12_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_12_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_12_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Eric Wohlstadter (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297640#comment-16297640
 ] 

Eric Wohlstadter commented on TEZ-3879:
---

[~sershe]

Can you attach the whole log?

How do you know that the task is still running?

> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
> URL: https://issues.apache.org/jira/browse/TEZ-3879
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption). 
> On the nodes, the abort is handled by TezTaskRunner2 and it looks like 
> there's some race there for some cases.
> Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
> does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_12_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_12_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_12_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297650#comment-16297650
 ] 

Sergey Shelukhin commented on TEZ-3879:
---

The whole log is very large... do you have smth in particular of interest? 
There are not really any more log statement for these tasks at the time.
The task has finished successfully and the then found at the end that it has 
been killed.

> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
> URL: https://issues.apache.org/jira/browse/TEZ-3879
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption). 
> On the nodes, the abort is handled by TezTaskRunner2 and it looks like 
> there's some race there for some cases.
> Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
> does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_12_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_12_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_12_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)