[ https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530259#comment-14530259 ]
Jeff Zhang edited comment on TEZ-2404 at 5/6/15 10:11 AM: ---------------------------------------------------------- bq. This does still give us most of the benefits of TEZ-2325, since TaskComplete events are received once per task - but TASK_STATUS_UPDATES are received every 100ms / heartbeat-interval - which can amount to a large number of events for even short running tasks. +1 on this. Recovery only depend on TaskAttemptFinishedEvent & DataMovementEvent and require DataMovementEvent logged before TaskAttemptFinishedEvent. The patch should be able to gurantee DataMovementEvent is logged before TaskAttemptFinishedEvent and TaskAttemptFinishedEvent is routed to TaskAttempt after the TaskStatusUpdate. Any other ordering issues in your mind ? [~bikassaha] was (Author: zjffdu): bq. This does still give us most of the benefits of TEZ-2325, since TaskComplete events are received once per task - but TASK_STATUS_UPDATES are received every 100ms / heartbeat-interval - which can amount to a large number of events for even short running tasks. +1 on this. Recovery only depend on TaskAttemptFinishedEvent & DataMovementEvent and require DataMovementEvent logged before TaskAttemptFinishedEvent. The patch should be able to gurantee TaskAttemptFinishedEvent is routed to TaskAttempt after the TaskStatusUpdate. Any other ordering issues in your mind ? [~bikassaha] > Handle DataMovementEvent before its TaskAttemptCompletedEvent > ------------------------------------------------------------- > > Key: TEZ-2404 > URL: https://issues.apache.org/jira/browse/TEZ-2404 > Project: Apache Tez > Issue Type: Bug > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Priority: Critical > Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch > > > TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it > would cause recovery issue. Recovery need that DataMovement event is handled > before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in > recovering and cause the its dependent tasks hang. > 2 Ways to fix this issue. > 1. Still route TaskAtttemptCompletedEvent in Vertex > 2. route DataMovementEvent before TaskAttemptCompeltedEvent in > TezTaskAttemptListener -- This message was sent by Atlassian JIRA (v6.3.4#6332)