[ https://issues.apache.org/jira/browse/MAPREDUCE-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved MAPREDUCE-2039. ----------------------------------------- Resolution: Fixed Spec exec did get several improvements. So closing this. > Improve speculative execution > ----------------------------- > > Key: MAPREDUCE-2039 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2039 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Dick King > Assignee: Dick King > > In speculation, the framework issues a second task attempt on a task where > one attempt is already running. This is useful if the running attempt is > bogged down for reasons outside of the task's code, so a second attempt > finishes ahead of the existing attempt, even though the first attempt has a > head start. > Early versions of speculation had the weakness that an attempt that starts > out well but breaks down near the end would never get speculated. That got > fixed in HADOOP:2141 , but in the fix the speculation wouldn't engage until > the performance of the old attempt, _even counting the early portion where it > progressed normally_ , was significantly worse than average. > I want to fix that by overweighting the more recent progress increments. In > particular, I would like to use exponential smoothing with a lambda of > approximately 1/minute [which is the time scale of speculative execution] to > measure progress per unit time. This affects the speculation code in two > places: > * It affects the set of task attempts we consider to be underperforming > * It affects our estimates of when we expect tasks to finish. This could > be hugely important; speculation's main benefit is that it gets a single > outlier task finished earlier than otherwise possible, and we need to know > which task is the outlier as accurately as possible. > I would like a rich suite of configuration variables, minimally including > lambda and possibly weighting factors. We might have two exponentially > smoothed tracking variables of the progress rate, to diagnose attempts that > are bogged down and getting worse vrs. bogging down but improving. > Perhaps we should be especially eager to speculate a second attempt. If a > task is deterministically failing after bogging down [think "rare infinite > loop bug"] we would rather take a couple of our attempts in parallel to > discover the problem sooner. > As part of this patch we would like to add benchmarks that simulate rare > tasks that behave poorly, so we can discover whether this change in the code > is a good idea and what the proper configuration is. Early versions of this > will be driven by our assumptions. Later versions will be driven by the > fruits of MAPREDUCE:2037 -- This message was sent by Atlassian JIRA (v6.2#6252)