[jira] [Commented] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt
[ https://issues.apache.org/jira/browse/TEZ-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836066#comment-16836066 ] Ying Han commented on TEZ-4068: --- Indeed in most cases a speculative attempt scheduled once a canCommit has been issued would be cancelled before completion. I would like to mention though, that there is a slight chance that an attempt can still fail after canCommit: between invocation of TaskImpl#canCommit and the sending of TaskAttemptCompletedEvent. That being said, I do agree that speculative attempt scheduled after commit has been initialized would be most likely wasted, and it is a reasonable optimization to prevent that from happening. I would like to take on this JIRA and has assigned it to myself, [~jeagles]. > Prevent new speculative attempt after task has issued canCommit to an attempt > - > > Key: TEZ-4068 > URL: https://issues.apache.org/jira/browse/TEZ-4068 > Project: Apache Tez > Issue Type: Improvement >Reporter: Jonathan Eagles >Priority: Major > > When a running attempt calls TaskImpl#canCommit through the taskUmbilical, > the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise > it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no > other attempt is allowed to succeed at that point. So a speculative attempt > that is launched after commitAttempt is assigned can never finished before > the original since is will allows be given a "no-go" in the canCommit > response. In this jira, I propose to discuss disabling speculative attempts > after commitAttempt has been assigned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt
[ https://issues.apache.org/jira/browse/TEZ-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying Han reassigned TEZ-4068: - Assignee: Ying Han > Prevent new speculative attempt after task has issued canCommit to an attempt > - > > Key: TEZ-4068 > URL: https://issues.apache.org/jira/browse/TEZ-4068 > Project: Apache Tez > Issue Type: Improvement >Reporter: Jonathan Eagles >Assignee: Ying Han >Priority: Major > > When a running attempt calls TaskImpl#canCommit through the taskUmbilical, > the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise > it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no > other attempt is allowed to succeed at that point. So a speculative attempt > that is launched after commitAttempt is assigned can never finished before > the original since is will allows be given a "no-go" in the canCommit > response. In this jira, I propose to discuss disabling speculative attempts > after commitAttempt has been assigned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TEZ-2249) Wait for all task attempt finished before moving Task to finished state
[ https://issues.apache.org/jira/browse/TEZ-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835901#comment-16835901 ] Jonathan Eagles edited comment on TEZ-2249 at 5/8/19 9:38 PM: -- TEZ-4068 may be a way to prevent the likelihood of temporary directories being created after task has committed and before vertex has committed. This JIRA, on the other hand, would permanently prevent that case. was (Author: jeagles): TEZ-4068 may be a way to prevent the likelihood temporary directories being created after task has committed and before vertex has committed. > Wait for all task attempt finished before moving Task to finished state > --- > > Key: TEZ-2249 > URL: https://issues.apache.org/jira/browse/TEZ-2249 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > Attachments: TEZ-2249-1.patch > > > 2 cases: > * If Task needs to move the SUCCEEDED, then committing may happens while > there's still task attempt running. > * If Tasks needs to move to FAILED/KILLED/ERROD, then aborting may happens > while there's still task attempt running. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-2249) Wait for all task attempt finished before moving Task to finished state
[ https://issues.apache.org/jira/browse/TEZ-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835901#comment-16835901 ] Jonathan Eagles commented on TEZ-2249: -- TEZ-4068 may be a way to prevent the likelihood temporary directories being created after task has committed and before vertex has committed. > Wait for all task attempt finished before moving Task to finished state > --- > > Key: TEZ-2249 > URL: https://issues.apache.org/jira/browse/TEZ-2249 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > Attachments: TEZ-2249-1.patch > > > 2 cases: > * If Task needs to move the SUCCEEDED, then committing may happens while > there's still task attempt running. > * If Tasks needs to move to FAILED/KILLED/ERROD, then aborting may happens > while there's still task attempt running. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt
[ https://issues.apache.org/jira/browse/TEZ-4068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835897#comment-16835897 ] Jonathan Eagles commented on TEZ-4068: -- [~Chyler], This change in behavior is similar to the TaskImpl state machine change made in TEZ-4062. I would like to hear your thoughts on this jira and whether it is a good change or not. > Prevent new speculative attempt after task has issued canCommit to an attempt > - > > Key: TEZ-4068 > URL: https://issues.apache.org/jira/browse/TEZ-4068 > Project: Apache Tez > Issue Type: Improvement >Reporter: Jonathan Eagles >Priority: Major > > When a running attempt calls TaskImpl#canCommit through the taskUmbilical, > the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise > it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no > other attempt is allowed to succeed at that point. So a speculative attempt > that is launched after commitAttempt is assigned can never finished before > the original since is will allows be given a "no-go" in the canCommit > response. In this jira, I propose to discuss disabling speculative attempts > after commitAttempt has been assigned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4068) Prevent new speculative attempt after task has issued canCommit to an attempt
Jonathan Eagles created TEZ-4068: Summary: Prevent new speculative attempt after task has issued canCommit to an attempt Key: TEZ-4068 URL: https://issues.apache.org/jira/browse/TEZ-4068 Project: Apache Tez Issue Type: Improvement Reporter: Jonathan Eagles When a running attempt calls TaskImpl#canCommit through the taskUmbilical, the TaskImpl will issue a "go" if it is the first attempt to do so. Otherwise it will issue a "no-go". After commitAttempt is assigned is TaskImpl, no other attempt is allowed to succeed at that point. So a speculative attempt that is launched after commitAttempt is assigned can never finished before the original since is will allows be given a "no-go" in the canCommit response. In this jira, I propose to discuss disabling speculative attempts after commitAttempt has been assigned. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (TEZ-4067) Tez Speculation decision is calculated on each update by the dispatcher
Ahmed Hussein created TEZ-4067: -- Summary: Tez Speculation decision is calculated on each update by the dispatcher Key: TEZ-4067 URL: https://issues.apache.org/jira/browse/TEZ-4067 Project: Apache Tez Issue Type: Improvement Reporter: Ahmed Hussein LegacySpeculator is an object field in VertexImpl. Therefore, all events are handled synchronously by the caller (dispatcher). This implies the following: # the dispatcher spends long time executing updateStatus as it needs to check the runtime estimation of the tezAttempts within the vertex. # the speculator is per stage: lunching a speculation may not the optimum decision. Ideally, based on resources, speculated tasks should be the ones with slowest progress. # the time between speculation is skewed because there is a big delay for the dispatcher to complete a full cycle. Also, speculation will be more aggressive compared to MR because MR waits for "soonest.retry.after.speculate" whenever a task is speculated. On the other hand, Tez speculates more tasks as it processes stages in parallel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-2249) Wait for all task attempt finished before moving Task to finished state
[ https://issues.apache.org/jira/browse/TEZ-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835691#comment-16835691 ] Jonathan Eagles commented on TEZ-2249: -- Looked at MapReduce for a similar feature, but there is none. It is susceptible to the same race condition. I have seen this occur recently and the outcome can be bad since temporary directory (and presumably files) can show up after the vertex stage commits. If subsequent stages are triggered based on a SUCCESS file being written, this can cause issues and contents change after the SUCCESS marker is created (a '_SUCCESS' file). If there is still interest, I could help work on this patch (giving [~zjffdu] proper credit) as assignee isn't able to work on this. > Wait for all task attempt finished before moving Task to finished state > --- > > Key: TEZ-2249 > URL: https://issues.apache.org/jira/browse/TEZ-2249 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > Attachments: TEZ-2249-1.patch > > > 2 cases: > * If Task needs to move the SUCCEEDED, then committing may happens while > there's still task attempt running. > * If Tasks needs to move to FAILED/KILLED/ERROD, then aborting may happens > while there's still task attempt running. -- This message was sent by Atlassian JIRA (v7.6.3#76005)