Hari Sekhon created TEZ-2322:
--------------------------------

             Summary: Succeeded count wrong for Pig on Tez job, decreased 380 
=> 181
                 Key: TEZ-2322
                 URL: https://issues.apache.org/jira/browse/TEZ-2322
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.5.2
         Environment: HDP 2.2
            Reporter: Hari Sekhon
            Priority: Minor


During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions, but surely once a task has finished successfully and is marked as 
succeeded it cannot then be removed from the succeeded count? Perhaps the 
succeeded counter is incremented too early before the task results are really 
saved?

KilledTaskAttempts jumped from 16 => 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 => 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to