[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496555#comment-14496555
 ] 

Hari Sekhon commented on TEZ-2322:
----------------------------------

There was a point at which space ran out and kerberos also broke as a result, 
but I fixed it and the job continued and eventually succeeded.

> Succeeded count wrong for Pig on Tez job, decreased 380 => 181
> --------------------------------------------------------------
>
>                 Key: TEZ-2322
>                 URL: https://issues.apache.org/jira/browse/TEZ-2322
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.2
>         Environment: HDP 2.2
>            Reporter: Hari Sekhon
>            Priority: Minor
>         Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
> attempt2_syslog_dag_1427546104095_0146_1_post
>
>
> During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 
> as shown below:
> {code}
> 2015-04-15 15:09:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
> Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
> 2015-04-15 15:10:56,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:36,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:11:56,993 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
> 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
> 2015-04-15 15:12:16,992 [Timer-0] INFO  
> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
> status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
> 0 
> {code}
> Now this may be because the tasks failed, some certainly did due to space 
> exceptions having checked the logs, but surely once a task has finished 
> successfully and is marked as succeeded it cannot then later be removed from 
> the succeeded count? Perhaps the succeeded counter is incremented too early 
> before the task results are really saved?
> KilledTaskAttempts jumped from 16 => 89 at the same time, but even this 
> doesn't account for the large drop in number of succeeded tasks.
> There was also a noticeable jump in Running tasks from 58 => 724 at the same 
> time which is suspicious, I'm pretty sure there was no contending job to 
> finish and release so much more resource to this Tez job, so it's also 
> unclear how the running count count have jumped up to significantly given the 
> cluster hardware resources have been the same throughout.
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to