Improve speculative execution
-----------------------------

                 Key: MAPREDUCE-2039
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2039
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Dick King
            Assignee: Dick King


In speculation, the framework issues a second task attempt on a task where one 
attempt is already running.  This is useful if the running attempt is bogged 
down for reasons outside of the task's code, so a second attempt finishes ahead 
of the existing attempt, even though the first attempt has a head start.

Early versions of speculation had the weakness that an attempt that starts out 
well but breaks down near the end would never get speculated.  That got fixed 
in HADOOP:2141 , but in the fix the speculation wouldn't engage until the 
performance of the old attempt, _even counting the early portion where it 
progressed normally_ , was significantly worse than average.

I want to fix that by overweighting the more recent progress increments.  In 
particular, I would like to use exponential smoothing with a lambda of 
approximately 1/minute [which is the time scale of speculative execution] to 
measure progress per unit time.  This affects the speculation code in two 
places:

   * It affects the set of task attempts we consider to be underperforming
   * It affects our estimates of when we expect tasks to finish.  This could be 
hugely important; speculation's main benefit is that it gets a single outlier 
task finished earlier than otherwise possible, and we need to know which task 
is the outlier as accurately as possible.

I would like a rich suite of configuration variables, minimally including 
lambda and possibly weighting factors.  We might have two exponentially 
smoothed tracking variables of the progress rate, to diagnose attempts that are 
bogged down and getting worse vrs. bogging down but improving.


Perhaps we should be especially eager to speculate a second attempt.  If a task 
is deterministically failing after bogging down [think "rare infinite loop 
bug"] we would rather take a couple of our attempts in parallel to discover the 
problem sooner.


As part of this patch we would like to add benchmarks that simulate rare tasks 
that behave poorly, so we can discover whether this change in the code is a 
good idea and what the proper configuration is.  Early versions of this will be 
driven by our assumptions.  Later versions will be driven by the fruits of 
MAPREDUCE:2037

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to