[jira] [Updated] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
[ https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2484: - Attachment: Tez_RM_misreporting_succeeded.png Attaching screenshot of Yarn Resource Manager line showing this Tez job being incorrectly reported as succeeded despite failure output in user session. Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon Attachments: Tez_RM_misreporting_succeeded.png When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 1 1477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded
Hari Sekhon created TEZ-2484: Summary: Tez vertex for Hive fails but Resource Manager reports job succeeded Key: TEZ-2484 URL: https://issues.apache.org/jira/browse/TEZ-2484 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2.4.2 Reporter: Hari Sekhon When running a Hive on Tez job via Hive CLI the job fails and I get the error shown below but in the Resource Manager the job is shown as Succeeded, even though it's clearly failed: {code} Status: Running (Executing on YARN cluster with App id application_1432310690008_0103) VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 1478 00 1478 11477 VERTICES: 00/01 [--] 0%ELAPSED TIME: 1589.41 s Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, diagnostics=[TaskAttempt 0 failed, info=[ Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null] DAG failed due to vertex failure. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-2370) Add stages information to RM UI for debugging / visibility on job progress
[ https://issues.apache.org/jira/browse/TEZ-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon resolved TEZ-2370. -- Resolution: Fixed Fix Version/s: 0.6.0 Ok great thanks, I'll look forward to upgrading to that... I also saw Hortonworks recent announcement for a Tez job view for Ambari which I'm looking forward to trying once it's GA. Add stages information to RM UI for debugging / visibility on job progress -- Key: TEZ-2370 URL: https://issues.apache.org/jira/browse/TEZ-2370 Project: Apache Tez Issue Type: Improvement Components: UI Affects Versions: 0.5.2 Environment: HDP 2.2.0 Reporter: Hari Sekhon Priority: Minor Fix For: 0.6.0 Something that has been bugging me since last year is the difficulty of debugging Tez jobs compared to MapReduce jobs. This is because Resource Manager / Application Master does not display the job stats and stages that we are used to seeing in MapReduce eg. Map and Reduce task counts and progress. I appreciate that Tez is a more flexible framework with a DAG but it would be nice if it could surface the information on the different stages, number of tasks running, completed, failed, killed, successful etc, similar to how Spark does, and the stage breakdown would be useful in understanding what the job is doing at different times, what stage is getting stuck/failing etc. At the moment the only thing available is to trawl the logs or hope to have a console output where some of that information is available, both of which are non-ideal when debugging other's people's jobs after the fact. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2457) Improve Documentation to explicitly list all valid Tez configuration variables
Hari Sekhon created TEZ-2457: Summary: Improve Documentation to explicitly list all valid Tez configuration variables Key: TEZ-2457 URL: https://issues.apache.org/jira/browse/TEZ-2457 Project: Apache Tez Issue Type: Improvement Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Request to improve Tez documentation by adding a page showing all valid Tez configuration variables with their defaults and description as well as which MapReduce variables Tez respects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2370) Add stages information to RM UI for debugging / visibility on job progress
Hari Sekhon created TEZ-2370: Summary: Add stages information to RM UI for debugging / visibility on job progress Key: TEZ-2370 URL: https://issues.apache.org/jira/browse/TEZ-2370 Project: Apache Tez Issue Type: Improvement Components: UI Affects Versions: 0.5.2 Environment: HDP 2.2.0 Reporter: Hari Sekhon Priority: Minor Something that has been bugging me since last year is the difficulty of debugging Tez jobs compared to MapReduce jobs. This is because Resource Manager / Application Master does not display the job stats and stages that we are used to seeing in MapReduce eg. Map and Reduce task counts and progress. I appreciate that Tez is a more flexible framework with a DAG but it would be nice if it could surface the information on the different stages, number of tasks running, completed, failed, killed, successful etc, similar to how Spark does, and the stage breakdown would be useful in understanding what the job is doing at different times, what stage is getting stuck/failing etc. At the moment the only thing available is to trawl the logs or hope to have a console output where some of that information is available, both of which are non-ideal when debugging other's people's jobs after the fact. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504694#comment-14504694 ] Hari Sekhon commented on TEZ-2322: -- Hitesh Shah, the yarn logs command failed originally otherwise I would have supplied that output. Jeff Zhang I did note the job did succeed in the end - this is just a jira to mark that the counts were wrong, hence I've labelled this as minor priority to fix. Succeeded count wrong for Pig on Tez job, decreased 380 = 181 -- Key: TEZ-2322 URL: https://issues.apache.org/jira/browse/TEZ-2322 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Attachments: attempt1_syslog_dag_1427546104095_0146_1, attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, attempt2_syslog_dag_1427546104095_0146_1_post During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job. was: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly. was: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. was: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Description: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. Hari Sekhon http://www.linkedin.com/in/harisekhon was: During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO
[jira] [Created] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
Hari Sekhon created TEZ-2322: Summary: Succeeded count wrong for Pig on Tez job, decreased 380 = 181 Key: TEZ-2322 URL: https://issues.apache.org/jira/browse/TEZ-2322 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions, but surely once a task has finished successfully and is marked as succeeded it cannot then be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496553#comment-14496553 ] Hari Sekhon edited comment on TEZ-2322 at 4/15/15 5:25 PM: --- Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM and attached to this ticket for you. was (Author: harisekhon): Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM. Succeeded count wrong for Pig on Tez job, decreased 380 = 181 -- Key: TEZ-2322 URL: https://issues.apache.org/jira/browse/TEZ-2322 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Attachments: attempt1_syslog_dag_1427546104095_0146_1, attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, attempt2_syslog_dag_1427546104095_0146_1_post During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated TEZ-2322: - Attachment: attempt2_syslog_dag_1427546104095_0146_1_post attempt2_syslog_dag_1427546104095_0146_1 attempt2_syslog attempt1_syslog_dag_1427546104095_0146_1 Iirc Ambari still doesn't support Job History server so that command fails, but I've copied the logs out via RM. Succeeded count wrong for Pig on Tez job, decreased 380 = 181 -- Key: TEZ-2322 URL: https://issues.apache.org/jira/browse/TEZ-2322 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Attachments: attempt1_syslog_dag_1427546104095_0146_1, attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, attempt2_syslog_dag_1427546104095_0146_1_post During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181
[ https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496555#comment-14496555 ] Hari Sekhon commented on TEZ-2322: -- There was a point at which space ran out and kerberos also broke as a result, but I fixed it and the job continued and eventually succeeded. Succeeded count wrong for Pig on Tez job, decreased 380 = 181 -- Key: TEZ-2322 URL: https://issues.apache.org/jira/browse/TEZ-2322 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.2 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Attachments: attempt1_syslog_dag_1427546104095_0146_1, attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, attempt2_syslog_dag_1427546104095_0146_1_post During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 as shown below: {code} 2015-04-15 15:09:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics= 2015-04-15 15:10:56,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:36,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:11:56,993 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics= 2015-04-15 15:12:16,992 [Timer-0] INFO org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 {code} Now this may be because the tasks failed, some certainly did due to space exceptions having checked the logs, but surely once a task has finished successfully and is marked as succeeded it cannot then later be removed from the succeeded count? Perhaps the succeeded counter is incremented too early before the task results are really saved? KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't account for the large drop in number of succeeded tasks. There was also a noticeable jump in Running tasks from 58 = 724 at the same time which is suspicious, I'm pretty sure there was no contending job to finish and release so much more resource to this Tez job, so it's also unclear how the running count count have jumped up to significantly given the cluster hardware resources have been the same throughout. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332)