[jira] [Commented] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt

2019-05-08 Thread Ying Han (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836066#comment-16836066
 ] 

Ying Han commented on TEZ-4068:
---

Indeed in most cases a speculative attempt scheduled once a canCommit has been 
issued would be cancelled before completion. I would like to mention though, 
that there is a slight chance that an attempt can still fail after canCommit: 
between invocation of TaskImpl#canCommit and the sending of 
TaskAttemptCompletedEvent. 

That being said, I do agree that speculative attempt scheduled after commit has 
been initialized would be most likely wasted, and it is a reasonable 
optimization to prevent that from happening. I would like to take on this JIRA 
and has assigned it to myself, [~jeagles].

> Prevent new speculative attempt after task has issued canCommit to an attempt
> -
>
> Key: TEZ-4068
> URL: https://issues.apache.org/jira/browse/TEZ-4068
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Jonathan Eagles
>Priority: Major
>
> When a running attempt calls TaskImpl#canCommit through the taskUmbilical, 
> the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise 
> it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no 
> other attempt is allowed to succeed at that point. So a speculative attempt 
> that is launched after commitAttempt is assigned can never finished before 
> the original since is will allows be given a "no-go" in the canCommit 
> response. In this jira, I propose to discuss disabling speculative attempts 
> after commitAttempt has been assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt

2019-05-08 Thread Ying Han (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying Han reassigned TEZ-4068:
-

Assignee: Ying Han

> Prevent new speculative attempt after task has issued canCommit to an attempt
> -
>
> Key: TEZ-4068
> URL: https://issues.apache.org/jira/browse/TEZ-4068
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Jonathan Eagles
>Assignee: Ying Han
>Priority: Major
>
> When a running attempt calls TaskImpl#canCommit through the taskUmbilical, 
> the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise 
> it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no 
> other attempt is allowed to succeed at that point. So a speculative attempt 
> that is launched after commitAttempt is assigned can never finished before 
> the original since is will allows be given a "no-go" in the canCommit 
> response. In this jira, I propose to discuss disabling speculative attempts 
> after commitAttempt has been assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TEZ-2249) Wait for all task attempt finished before moving Task to finished state

2019-05-08 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835901#comment-16835901
 ] 

Jonathan Eagles edited comment on TEZ-2249 at 5/8/19 9:38 PM:
--

TEZ-4068 may be a way to prevent the likelihood of temporary directories being 
created after task has committed and before vertex has committed. This JIRA, on 
the other hand, would permanently prevent that case.


was (Author: jeagles):
TEZ-4068 may be a way to prevent the likelihood temporary directories being 
created after task has committed and before vertex has committed.

> Wait for all task attempt finished before moving Task to finished state
> ---
>
> Key: TEZ-2249
> URL: https://issues.apache.org/jira/browse/TEZ-2249
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
> Attachments: TEZ-2249-1.patch
>
>
> 2 cases:
> * If Task needs to move the SUCCEEDED, then committing may happens while 
> there's still task attempt running.
> * If Tasks needs to move to FAILED/KILLED/ERROD, then aborting may happens 
> while there's still task attempt running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-2249) Wait for all task attempt finished before moving Task to finished state

2019-05-08 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835901#comment-16835901
 ] 

Jonathan Eagles commented on TEZ-2249:
--

TEZ-4068 may be a way to prevent the likelihood temporary directories being 
created after task has committed and before vertex has committed.

> Wait for all task attempt finished before moving Task to finished state
> ---
>
> Key: TEZ-2249
> URL: https://issues.apache.org/jira/browse/TEZ-2249
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
> Attachments: TEZ-2249-1.patch
>
>
> 2 cases:
> * If Task needs to move the SUCCEEDED, then committing may happens while 
> there's still task attempt running.
> * If Tasks needs to move to FAILED/KILLED/ERROD, then aborting may happens 
> while there's still task attempt running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt

2019-05-08 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835897#comment-16835897
 ] 

Jonathan Eagles commented on TEZ-4068:
--

[~Chyler], This change in behavior is similar to the TaskImpl state machine 
change made in TEZ-4062. I would like to hear your thoughts on this jira and 
whether it is a good change or not. 

> Prevent new speculative attempt after task has issued canCommit to an attempt
> -
>
> Key: TEZ-4068
> URL: https://issues.apache.org/jira/browse/TEZ-4068
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Jonathan Eagles
>Priority: Major
>
> When a running attempt calls TaskImpl#canCommit through the taskUmbilical, 
> the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise 
> it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no 
> other attempt is allowed to succeed at that point. So a speculative attempt 
> that is launched after commitAttempt is assigned can never finished before 
> the original since is will allows be given a "no-go" in the canCommit 
> response. In this jira, I propose to discuss disabling speculative attempts 
> after commitAttempt has been assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt

2019-05-08 Thread Jonathan Eagles (JIRA)
Jonathan Eagles created TEZ-4068:


 Summary: Prevent new speculative attempt after task has issued 
canCommit to an attempt
 Key: TEZ-4068
 URL: https://issues.apache.org/jira/browse/TEZ-4068
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Jonathan Eagles


When a running attempt calls TaskImpl#canCommit through the taskUmbilical, the 
TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise it 
will issue a "no-go". After commitAttempt is assigned is TaskImpl, no other 
attempt is allowed to succeed at that point. So a speculative attempt that is 
launched after commitAttempt is assigned can never finished before the original 
since is will allows be given a "no-go" in the canCommit response. In this 
jira, I propose to discuss disabling speculative attempts after commitAttempt 
has been assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher

2019-05-08 Thread Ahmed Hussein (JIRA)
Ahmed Hussein created TEZ-4067:
--

 Summary: Tez Speculation decision is calculated on each update by 
the dispatcher
 Key: TEZ-4067
 URL: https://issues.apache.org/jira/browse/TEZ-4067
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Ahmed Hussein


LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
handled synchronously by the caller (dispatcher). This implies the following:
 # the dispatcher spends long time executing updateStatus as it needs to check 
the runtime estimation of the tezAttempts within the vertex.
 # the speculator is per stage: lunching a speculation may not the optimum 
decision. Ideally, based on resources, speculated tasks should be the ones with 
slowest progress.
 # the time between speculation is skewed because there is a big delay for the 
dispatcher to complete a full cycle. Also, speculation will be more aggressive 
compared to MR because MR waits for "soonest.retry.after.speculate" whenever a 
task is speculated. On the other hand, Tez speculates more tasks as it 
processes stages in parallel.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-2249) Wait for all task attempt finished before moving Task to finished state

2019-05-08 Thread Jonathan Eagles (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835691#comment-16835691
 ] 

Jonathan Eagles commented on TEZ-2249:
--

Looked at MapReduce for a similar feature, but there is none. It is susceptible 
to the same race condition. I have seen this occur recently and the outcome can 
be bad since temporary directory (and presumably files) can show up after the 
vertex stage commits. If subsequent stages are triggered based on a SUCCESS 
file being written, this can cause issues and contents change after the SUCCESS 
marker is created (a '_SUCCESS' file).

If there is still interest, I could help work on this patch (giving [~zjffdu] 
proper credit) as assignee isn't able to work on this.

> Wait for all task attempt finished before moving Task to finished state
> ---
>
> Key: TEZ-2249
> URL: https://issues.apache.org/jira/browse/TEZ-2249
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
> Attachments: TEZ-2249-1.patch
>
>
> 2 cases:
> * If Task needs to move the SUCCEEDED, then committing may happens while 
> there's still task attempt running.
> * If Tasks needs to move to FAILED/KILLED/ERROD, then aborting may happens 
> while there's still task attempt running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)