[ https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587023#comment-15587023 ]
Hitesh Shah edited comment on TEZ-3479 at 10/18/16 11:31 PM: ------------------------------------------------------------- Atleast for this scenario, I think we did not recover task_1476667862449_0031_1_07_000004 properly to a failed state which ends up leading to a hang as the vertex cannot complete. {code} 2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE {code} The task failure tracked is for task_1476667862449_0031_1_07_000000 and not for 0004. was (Author: hitesh): Atleast for this scenario, I think we did not recover task_1476667862449_0031_1_07_000004 properly to a failed state which ends up leading to a hang as the vertex cannot complete. {code} 2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE {code} > DAG AM does not schedule any more containers in corner cases > ------------------------------------------------------------ > > Key: TEZ-3479 > URL: https://issues.apache.org/jira/browse/TEZ-3479 > Project: Apache Tez > Issue Type: Improvement > Affects Versions: 0.7.1 > Reporter: Rajesh Balamohan > Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz > > > Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7. > Some workloads end up generating lots of data that the tasks start throwing > "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after > enough number of retries which happens most of the time. Once in a while (~ > once in 20-30 runs), DAG AM gets into hung state and does not schedule any > more containers for the failed task attempts. Will attach the logs shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)