[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530259#comment-14530259
 ] 

Jeff Zhang edited comment on TEZ-2404 at 5/6/15 10:11 AM:
----------------------------------------------------------

bq. This does still give us most of the benefits of TEZ-2325, since 
TaskComplete events are received once per task - but TASK_STATUS_UPDATES are 
received every 100ms / heartbeat-interval - which can amount to a large number 
of events for even short running tasks.
+1 on this.  Recovery only depend on TaskAttemptFinishedEvent & 
DataMovementEvent and require DataMovementEvent logged before 
TaskAttemptFinishedEvent. The patch should be able to gurantee 
DataMovementEvent is logged before TaskAttemptFinishedEvent and 
TaskAttemptFinishedEvent is routed to TaskAttempt after the TaskStatusUpdate. 
Any other ordering issues in your mind ? [~bikassaha]


was (Author: zjffdu):
bq. This does still give us most of the benefits of TEZ-2325, since 
TaskComplete events are received once per task - but TASK_STATUS_UPDATES are 
received every 100ms / heartbeat-interval - which can amount to a large number 
of events for even short running tasks.
+1 on this.  Recovery only depend on TaskAttemptFinishedEvent & 
DataMovementEvent and require DataMovementEvent logged before 
TaskAttemptFinishedEvent. The patch should be able to gurantee 
TaskAttemptFinishedEvent is routed to TaskAttempt after the TaskStatusUpdate. 
Any other ordering issues in your mind ? [~bikassaha]

> Handle DataMovementEvent before its TaskAttemptCompletedEvent
> -------------------------------------------------------------
>
>                 Key: TEZ-2404
>                 URL: https://issues.apache.org/jira/browse/TEZ-2404
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Critical
>         Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch
>
>
> TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
> would cause recovery issue. Recovery need that DataMovement event is handled 
> before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
> recovering and cause the its dependent tasks hang.
> 2 Ways to fix this issue.
> 1. Still route TaskAtttemptCompletedEvent in Vertex
> 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
> TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to