[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-07 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532076#comment-14532076
 ] 

TezQA commented on TEZ-2404:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12731076/TEZ-2404-3.patch
  against master revision 02870f0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.test.TestFaultTolerance

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/648//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/648//console

This message is automatically generated.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch, TEZ-2404-3.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530062#comment-14530062
 ] 

Siddharth Seth commented on TEZ-2404:
-

I'm not completely aware of how the recovery code works. Assuming the 
TASK_FINISHED_EVENT triggers some kind of a sync point which is hit in the 
Vertex to ensure all source events are serialized.
Won't special casing TASK_COMPLETE (DONE, FAILED, etc) to go to VertexImpl and 
TASK_STATUS_UPDATE to go to TaskImpl work ? - as long as they 
TASK_STATUS_UPDATE goes before the TASK_COMPLETE event.
Both would go out on the main dispatcher so ordering is maintained.
This does still give us most of the benefits of TEZ-2325, since TaskComplete 
events are received once per task - but TASK_STATUS_UPDATES are received every 
100ms / heartbeat-interval - which can amount to a large number of events for 
even short running tasks.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531647#comment-14531647
 ] 

Siddharth Seth commented on TEZ-2404:
-

Just tried the .2 patch out on several large jobs (3 X 20Lx20K, 2X 50Kx50K 
OneToOne edge). Each task with a 1 second sleep which should allow for 9-10 
status update events. No issues running any of them.
I'm +1 on the .2 patch going in.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531146#comment-14531146
 ] 

Bikas Saha commented on TEZ-2404:
-

No they were not.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531163#comment-14531163
 ] 

Siddharth Seth commented on TEZ-2404:
-

Sounds like changing
{code}
if (eventType == EventType.TASK_STATUS_UPDATE_EVENT ||
  eventType == EventType.TASK_ATTEMPT_COMPLETED_EVENT) {
{code}

to 
{code}
if (eventType == EventType.TASK_STATUS_UPDATE_EVENT) {
{code}
would fix both issues ?

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531133#comment-14531133
 ] 

Siddharth Seth commented on TEZ-2404:
-

Were the jobs running with parallel dispatchers ? That will likely cause issues.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531054#comment-14531054
 ] 

Bikas Saha commented on TEZ-2404:
-

You will have to find out by trying large jobs in a cluster. Which is where I 
noticed race conditions that resulted in creating TEZ-2418.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531953#comment-14531953
 ] 

Jeff Zhang commented on TEZ-2404:
-

Need to add some comment, and will commit soon. 

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531932#comment-14531932
 ] 

Siddharth Seth commented on TEZ-2404:
-

[~zjffdu] - are any more changes required here ? or can this be committed ?

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531791#comment-14531791
 ] 

Bikas Saha commented on TEZ-2404:
-

The failures that I were seeing are unrelated to this or TEZ-2325. Looked at 
that further and opened TEZ-2426.

We can commit the current patch and keep TEZ-2418 open and expand its scope to 
move both completed and failed events back to the be sent directly to the task 
attempt. I will update TEZ-2418.

We should open a jira to make recovery resilient to ordering of these events 
and make TEZ-2418 blocked on this jira. The change in this patch is creating a 
nuanced routing where some events are single routed and some are double routed 
with the implicit assumption that ordering is being maintained because the 
double routed event was initially after the single routed event and all the 
routing happens on the same thread. So double routing delays it further on the 
same thread and we are safe. If the double routed event was actually ahead then 
this would break immediately. IMO this kind of nuanced event routing is not 
something we should keep around for long.


 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-06 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532007#comment-14532007
 ] 

Jeff Zhang commented on TEZ-2404:
-

Upload a new patch ( add some comments to highlight the event ordering and 
routing )

Thanks for your review [~bikassaha] [~sseth] [~hitesh] Committed to master  
branch-0.7


 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch, TEZ-2404-3.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529528#comment-14529528
 ] 

Bikas Saha commented on TEZ-2404:
-

Mixing some events going directly and some events going indirectly, may cause 
ordering issues. So the current patch may introduce those issues.

So we either fix recovery or punt it for later and revert TEZ-2325. If we 
revert TEZ-2325 then lets please create a jira for the recovery fix (unless one 
exists already) and mark that as a blocking TEZ-2325 and TEZ-2418.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-05 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529509#comment-14529509
 ] 

Hitesh Shah commented on TEZ-2404:
--

Given this jira and TEZ-2418, maybe it might be better to revert TEZ-2325 for 
now until a full solution with recovery handled correctly is in place? \cc 
[~zjffdu] [~pramachandran]

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-05 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529432#comment-14529432
 ] 

Bikas Saha commented on TEZ-2404:
-

I am afraid the approach in the patch would be a regression because it is 
making task_completed_event getting double routed again. That is what was fixed 
in TEZ-2325. Double routing causes such an event to first get added to the end 
of the event queue, then get handled by the vertex, then get put back at the 
end of the event queue, then get handled by the task attempt.
We should look at fixing this in some other manner. One idea would be the 
following. When a task succeeds, it chooses a successful attempt and terminates 
all other attempts. It could send all these attempt ids to the vertex in the 
TaskCompletedEvent and the vertex could write the end marker for the events 
from this list. So recovery would need to change to use this end marker and not 
the attempt completed marker to determine that it has seen all events. Any 
other ideas?

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-05 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529440#comment-14529440
 ] 

Hitesh Shah commented on TEZ-2404:
--

The basic assumption in recovery is that all events are written to the recovery 
log before a task is marked as completed. I do not believe recovery even cares 
about distinguishing events generated by different attempts as it holds on to 
all of them and routes them on recovery. TEZ-2325 seems to have changed the 
above assumption. 

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-05 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529348#comment-14529348
 ] 

Hitesh Shah commented on TEZ-2404:
--

Patch looks good to me though a recovery related test to catch data movements 
being seen after a task completion would help. 

[~bikassaha] Mind doing another review pass to see if anything was missed? 

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526916#comment-14526916
 ] 

Bikas Saha commented on TEZ-2404:
-

TEZ-1897 is not enabled yet. So we dont have to fix this immediately. We can 
use the time to explore other solutions that dont involve routing the same 
event twice. E.g. when the task completes then it sends an event to its vertex 
so that the vertex can increment its completed task count. Can that be used to 
mark the successful attempt as done in the history logs by the vertex? 
Logically, from what I see, the vertex is using the task attempt completed 
event as a marker for the successful attempts history event completion, right? 
This approach may mean that an unsuccessful attempt will not have a completion 
marker. Will that be a problem? Maybe not, since we dont care about those 
attempts anyways. For work preserving AM restart we can discard these events if 
the running task has not reconnected with the AM. In the non-work-preserving AM 
restart case we can always discard these events.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2404) Handle DataMovementEvent before its TaskAttemptCompletedEvent

2015-05-04 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526931#comment-14526931
 ] 

Hitesh Shah commented on TEZ-2404:
--

BUmping up priority as this means recovery is potentially broken. 

[~zjffdu] It looks like we need a recovery related test to ensure that data 
movements events are always stored before a task completion event.

 Handle DataMovementEvent before its TaskAttemptCompletedEvent
 -

 Key: TEZ-2404
 URL: https://issues.apache.org/jira/browse/TEZ-2404
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Critical
 Attachments: TEZ-2404-1.patch, TEZ-2404-2.patch


 TEZ-2325 route TASK_ATTEMPT_COMPLETED_EVENT directly to the attempt, but it 
 would cause recovery issue. Recovery need that DataMovement event is handled 
 before TaskAttemptCompletedEvent, otherwise DataMovement event may be lost in 
 recovering and cause the its dependent tasks hang.
 2 Ways to fix this issue.
 1. Still route TaskAtttemptCompletedEvent in Vertex
 2. route DataMovementEvent before TaskAttemptCompeltedEvent in 
 TezTaskAttemptListener



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)