[ 
https://issues.apache.org/jira/browse/TEZ-4067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847860#comment-16847860
 ] 

Ahmed Hussein commented on TEZ-4067:
------------------------------------

An old [TEZ-3934|https://issues.apache.org/jira/browse/TEZ-3934] reported the 
race condition in the speculator code. When two tasksAttempts are updating 
their progress simultaneously, the speculator may create two speculative 
attempts for the same task.

The jira was closed after adding two more checks on the hashes to verify that 
no attempt was speculated while the current thread is busy with the calculation.

This does not solve the root problem caused by calling maybeSpeculate() after 
updating the progress. A proper fix would be to:
 * The event handler returns after updating the taskAttempt status
 * A separate thread "speculator" runs periodically to scan the tasks within a 
vertex to calculate the speculation.

 

Re-implimenting the speculator as-a-service requires the following changes:
 # add each vertex' speculator to a the list of services in the application 
master (i.e., DAGAppMaster)
 # api/DAG needs to support creating vertex speculator as a service.
 # Test cases (TestSpeculation) may need to be re-written because they were 
designed for single threaded implementation.

 

> Tez Speculation decision is calculated on each update by the dispatcher
> -----------------------------------------------------------------------
>
>                 Key: TEZ-4067
>                 URL: https://issues.apache.org/jira/browse/TEZ-4067
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ahmed Hussein
>            Assignee: Ahmed Hussein
>            Priority: Minor
>
> LegacySpeculator is an object field in VertexImpl. Therefore, all events are 
> handled synchronously by the caller (dispatcher). This implies the following:
>  # the dispatcher spends long time executing updateStatus as it needs to 
> check the runtime estimation of the tezAttempts within the vertex.
>  # the speculator is per stage: lunching a speculation may not the optimum 
> decision. Ideally, based on resources, speculated tasks should be the ones 
> with slowest progress.
>  # the time between speculation is skewed because there is a big delay for 
> the dispatcher to complete a full cycle. Also, speculation will be more 
> aggressive compared to MR because MR waits for 
> "soonest.retry.after.speculate" whenever a task is speculated. On the other 
> hand, Tez speculates more tasks as it processes stages in parallel.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to