If I'm understanding you correctly, then you are correct that the fair scheduler doesn't currently do everything that you want to achieve. Fair scheduler pools currently can be configured with a minimum number of cores that they will need before accepting Tasks, but there isn't a way to restrict a pool to use no more than a certain number of cores. That means that a lower-priority pool can grab all of the cores as long as there is no demand on high-priority pools, and then the higher-priority pools will have to wait for the lower-priority pool to complete Tasks before the higher-priority pools will be able to run Tasks. That means that fair scheduling pools really aren't a sufficient means to satisfy multi-tenancy requirements or other scenarios where you want a guarantee that there will always be some cores available to run a high-priority job. There is a JIRA issue and a PR out there to address some of this issue, and I've been starting to come around to the notion that we should support a max cores configuration for fair scheduler pools, but there is nothing like that available right now. Neither is there a way at the application level in a standalone-mode cluster for one application to pre-empt another in order to acquires its cores or other resources. YARN does provide some support for that, and Mesos may as well, so that is the closest option that I think currently exists to satisfy your requirement.
On Wed, Mar 2, 2016 at 6:20 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: > Mark, > > I'm trying to configure spark cluster to share resources between two pools. > > I can do that by assigning minimal shares (it works fine), but that means > specific amount of cores is going to be wasted by just being ready to run > anything. While that's better, than nothing, I'd like to specify percentage > of cores instead of specific number of cores as cluster might be changed in > size either up or down. Is there such an option? > > Also I haven't found anything about sort of preemptive scheduler for > standalone deployment (it is slightly mentioned in SPARK-9882, but it seems > to be abandoned). Do you know if there is such an activity? > > -- > Be well! > Jean Morozov > > On Sun, Feb 21, 2016 at 4:32 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> It's 2 -- and it's pretty hard to point to a line of code, a method, or >> even a class since the scheduling of Tasks involves a pretty complex >> interaction of several Spark components -- mostly the DAGScheduler, >> TaskScheduler/TaskSchedulerImpl, TaskSetManager, Schedulable and Pool, as >> well as the SchedulerBackend (CoarseGrainedSchedulerBackend in this case.) >> The key thing to understand, though, is the comment at the top of >> SchedulerBackend.scala: "A backend interface for scheduling systems that >> allows plugging in different ones under TaskSchedulerImpl. We assume a >> Mesos-like model where the application gets resource offers as machines >> become available and can launch tasks on them." In other words, the whole >> scheduling system is built on a model that starts with offers made by >> workers when resources are available to run Tasks. Other than the big >> hammer of canceling a Job while interruptOnCancel is true, there isn't >> really any facility for stopping or rescheduling Tasks that are already >> started, so that rules out your option 1. Similarly, option 3 is out >> because the scheduler doesn't know when Tasks will complete; it just knows >> when a new offer comes in and it is time to send more Tasks to be run on >> the machine making the offer. >> >> What actually happens is that the Pool with which a Job is associated >> maintains a queue of TaskSets needing to be scheduled. When in >> resourceOffers the TaskSchedulerImpl needs sortedTaskSets, the Pool >> supplies those from its scheduling queue after first sorting it according >> to the Pool's taskSetSchedulingAlgorithm. In other words, what Spark's >> fair scheduling does in essence is, in response to worker resource offers, >> to send new Tasks to be run; those Tasks are taken in sets from the queue >> of waiting TaskSets, sorted according to a scheduling algorithm. There is >> no pre-emption or rescheduling of Tasks that the scheduler has already sent >> to the workers, nor is there any attempt to anticipate when already running >> Tasks will complete. >> >> >> On Sat, Feb 20, 2016 at 4:14 PM, Eugene Morozov < >> evgeny.a.moro...@gmail.com> wrote: >> >>> Hi, >>> >>> I'm trying to understand how this thing works underneath. Let's say I >>> have two types of jobs - high important, that might use small amount of >>> cores and has to be run pretty fast. And less important, but greedy - uses >>> as many cores as available. So, the idea is to use two corresponding pools. >>> >>> Then thing I'm trying to understand is the following. >>> I use standalone spark deployment (no YARN, no Mesos). >>> Let's say that less important took all the cores and then someone runs >>> high important job. Then I see three possibilities: >>> 1. Spark kill some executors that currently runs less important >>> partitions to assign them to a high performant job. >>> 2. Spark will wait until some partitions of less important job will be >>> completely processed and then first executors that become free will be >>> assigned to process high important job. >>> 3. Spark will figure out specific time, when particular stages of >>> partitions of less important jobs is done, and instead of continue with >>> this job, these executors will be reassigned to high important one. >>> >>> Which one it is? Could you please point me to a class / method / line of >>> code? >>> -- >>> Be well! >>> Jean Morozov >>> >> >> >