> On March 7, 2018, 7:48 p.m., David McLaughlin wrote: > > So what happens if there are two bad hosts? :) > > Jordan Ly wrote: > This does not scale past n=1 > > We can make this more generic by getting the list of hosts the task has > previously failed on and looking through offers for a host the task did not > fail on for some operator defined value (something like > `-failure_avoidance_factor`) > > Santhosh Kumar Shanmugham wrote: > Note making this more generic is still incumbent on the amount of task > history we have on the scheduler. > > Jordan Ly wrote: > Discussed offline: > > Going to go a different route -- this method is very domain-specific and > does not allow for preemption to kick in since if there is only one host > matching and it is bad you can still be repeatedly scheduled on it. Instead, > going to go a more generic solution involving banning scheduling on a host > temporarily if the task fails on that host via `SchedulingFilter`. This would > be enabled through a operator-defined option.
Different idea: If the ancestor was LOST or FAILED, use a coin-flip to decide if we want to use a matching offer or not. This does not require additional state and gives sufficient chance for the task to come up in one of the future scheduling rounds. As it would be only used for re-scheduled tasks, it does not lead to a performance impact in the normal case. - Stephan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65941/#review198803 ----------------------------------------------------------- On March 7, 2018, 6:50 a.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65941/ > ----------------------------------------------------------- > > (Updated March 7, 2018, 6:50 a.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and > Stephan Erb. > > > Repository: aurora > > > Description > ------- > > If a task fails on a host, we should try to avoid rescheduling the task on > the same host if possible. This is done in order to avoid a potentially bad > host. This issue generally comes up when you are bin-packing hosts (i.e. > using the `-offer_order` option). > > If there are no other offers to schedule the task on, we will still use the > offer. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java > fcafecf63040f9c410458dedfd3d87b0d669d205 > > src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java > 864538b6730d7318385494818276ba370124b8e9 > > > Diff: https://reviews.apache.org/r/65941/diff/1/ > > > Testing > ------- > > `./gradlew test` > > Benchmarks and live-cluster testing coming soon. > > > Thanks, > > Jordan Ly > >