It's 2 -- and it's pretty hard to point to a line of code, a method, or even a class since the scheduling of Tasks involves a pretty complex interaction of several Spark components -- mostly the DAGScheduler, TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as well as the SchedulerBackend (CoarseGrainedSchedulerBackend in this case.) The key thing to understand, though, is the comment at the top of SchedulerBackend.scala: "A backend interface for scheduling systems that allows plugging in different ones under TaskSchedulerImpl. We assume a Mesos-like model where the application gets resource offers as machines become available and can launch tasks on them." In other words, the whole scheduling system is built on a model that starts with offers made by workers when resources are available to run Tasks. Other than the big hammer of canceling a Job while interruptOnCancel is true, there isn't really any facility for stopping or rescheduling Tasks that are already started, so that rules out your option 1. Similarly, option 3 is out because the scheduler doesn't know when Tasks will complete; it just knows when a new offer comes in and it is time to send more Tasks to be run on the machine making the offer.
What actually happens is that the Pool with which a Job is associated maintains a queue of TaskSets needing to be scheduled. When in resourceOffers the TaskSchedulerImpl needs sortedTaskSets, the Pool supplies those from its scheduling queue after first sorting it according to the Pool's taskSetSchedulingAlgorithm. In other words, what Spark's fair scheduling does in essence is, in response to worker resource offers, to send new Tasks to be run; those Tasks are taken in sets from the queue of waiting TaskSets, sorted according to a scheduling algorithm. There is no pre-emption or rescheduling of Tasks that the scheduler has already sent to the workers, nor is there any attempt to anticipate when already running Tasks will complete. On Sat, Feb 20, 2016 at 4:14 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: > Hi, > > I'm trying to understand how this thing works underneath. Let's say I have > two types of jobs - high important, that might use small amount of cores > and has to be run pretty fast. And less important, but greedy - uses as > many cores as available. So, the idea is to use two corresponding pools. > > Then thing I'm trying to understand is the following. > I use standalone spark deployment (no YARN, no Mesos). > Let's say that less important took all the cores and then someone runs > high important job. Then I see three possibilities: > 1. Spark kill some executors that currently runs less important partitions > to assign them to a high performant job. > 2. Spark will wait until some partitions of less important job will be > completely processed and then first executors that become free will be > assigned to process high important job. > 3. Spark will figure out specific time, when particular stages of > partitions of less important jobs is done, and instead of continue with > this job, these executors will be reassigned to high important one. > > Which one it is? Could you please point me to a class / method / line of > code? > -- > Be well! > Jean Morozov >