[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496553#comment-14496553 ]
Hari Sekhon edited comment on TEZ-2322 at 4/15/15 5:25 PM: ----------------------------------------------------------- Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM and attached to this ticket for you. was (Author: harisekhon): Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM. > Succeeded count wrong for Pig on Tez job, decreased 380 => 181 > -------------------------------------------------------------- > > Key: TEZ-2322 > URL: https://issues.apache.org/jira/browse/TEZ-2322 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.5.2 > Environment: HDP 2.2 > Reporter: Hari Sekhon > Priority: Minor > Attachments: attempt1_syslog_dag_1427546104095_0146_1, > attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, > attempt2_syslog_dag_1427546104095_0146_1_post > > > During a Pig on Tez job the number of succeeded tasks dropped from 380 => 181 > as shown below: > {code} > 2015-04-15 15:09:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 > Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= > 2015-04-15 15:10:56,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:36,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:11:56,993 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: > 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= > 2015-04-15 15:12:16,992 [Timer-0] INFO > org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: > status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: > 0 > {code} > Now this may be because the tasks failed, some certainly did due to space > exceptions having checked the logs, but surely once a task has finished > successfully and is marked as succeeded it cannot then later be removed from > the succeeded count? Perhaps the succeeded counter is incremented too early > before the task results are really saved? > KilledTaskAttempts jumped from 16 => 89 at the same time, but even this > doesn't account for the large drop in number of succeeded tasks. > There was also a noticeable jump in Running tasks from 58 => 724 at the same > time which is suspicious, I'm pretty sure there was no contending job to > finish and release so much more resource to this Tez job, so it's also > unclear how the running count count have jumped up to significantly given the > cluster hardware resources have been the same throughout. > Hari Sekhon > http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)