Thanks Sharma and Bill! This is exactly the input I was looking for. We will start by using an existing service scheduler and see where this leads us.
Best Regards, Stephan On Di, 2014-09-02 at 10:14 -0700, Bill Farner wrote: > I'll echo Sharma's points. While it seems simple enough to see which > moving parts you need to implement here, the long-term effort is > large. I've been working on Aurora for 4.5 years, and still know of a > lot of work we need to do. If your use case can fit into an existing > framework (perhaps mod a feature request/contribution here and there), > you'll free up a lot of time to focus on the problem you're actually > trying to solve. > > -=Bill > > > On Mon, Sep 1, 2014 at 10:45 AM, Sharma Podila <spod...@netflix.com> > wrote: > I am tempted to say that the short answer is, if your option B > works, why bother writing your own scheduler/framework? > > > Writing a Mesos framework can be easy. However, writing a > fault tolerant Mesos framework that has good scalability, is > performant, and is highly available can be relatively hard. > Here's a few things, off the top of my head, that helped us > make the decision to write our own: > * There must be a good long term reason to write your > own framework. The scheduling/preemption/allocation > model you spoke of may be a good reason. For us, it > was specific scheduling optimizations that are not > generic and are absent in other frameworks. > * Fault tolerance is a combination of a few things, > Here's a few to consider: > * Task reconciliation with Mesos master > currently will involve more than just using > the reconcile feature. We augment it with > heartbeats from tasks, Aurora does GC task, > etc.. I believe it will take another Mesos > release (or two?) before we can rely solely on > Mesos task reconciliation. > * Framework itself must be highly available, for > example, using ZooKeeper leader election among > multiple framework instances. > * Fault tolerant persistence of task states. For > example, when Mesos calls your framework with > a status update of a task, that state must be > reliably persisted. > * It sounds like achieving fair share allocation via > preemptions is important to you. That "external > entity" you refer to may be non-trivial in the long > run. If you were to embark on writing your own > framework, another model to consider is to just have > one framework scheduler instance for all users. Then, > put the preemptions and fair share logic inside it. > There could be complexities such as, > for heterogeneous mix of task and slave resource > sizes, scaling down an arbitrary number of tasks from > user A doesn't imply they will benefit user B. The > scheduler can perform this better than an external > entity, by only preempting the right ones, etc. > * That said, for simpler use cases, it may work > just fine to have an external entity. > * Scheduling itself is a hard problem. And can slow down > quickly when doing anything more than first-fit style, > by adding a few constraints and SLAs. Preemptions, for > example, can slow down the scheduler in figuring out > the right tasks to preempt to honor the fair share > SLAs. That is, assuming you have more than a few > hundred tasks. > * There were a few talks at MesosCon, ten days ago, on > this topic including one from us. The video/slides > from the conference should be available from MesosCon > sometime soon. > > > > > > > On Sun, Aug 31, 2014 at 7:51 AM, Stephan Erb > <step...@dev.static-void.de> wrote: > Hi everybody, > > I would like to assess the effort required to write a > custom framework. > > Background: We have an application where we can start > a flexible number > of long-running worker processes performing > number-crunching. The more > processes the better. However, we have multiple users, > each running an > instance of the application and therefore competing > for resources (as > each tries to run as many worker processes as > possible). > > For various reasons, we would like to run our > application instances on > top of mesos. There seem to be two ways to achieve > this: > > A. Write a custom framework for our application > that spawns the > worker processes on demand. Each user gets to > run one framework > instance. We also need preemption of workers > to achieve equality > among frameworks. We could achieve this using > an external entity > monitoring all frameworks and telling to worst > offenders to > scale down a little. > B. Instead of writing a framework, use a > Service-Scheduler like > Marathon, Aurora or Singularity to spawn the > worker processes. > Instead of just performing the scale-down, the > external entity > would dictate the number of worker processes > for each > application depending on its demand. > > > The first choice seems to be the natural fit for > Mesos. However, > existing framework like Aurora seem to be > battle-tested in regard to > high availability, race conditions and issues like > state reconciliation > where the world view of scheduler and slaves are > drifting apart. > > So this question boils down to: When considering to > write a custom > framework, which pitfalls do I have to be aware of? > Can I come away with > blindly implementing the scheduler API? Or do I always > have to implement > stuff like custom state-reconciliation in order to > prevent orphaned > tasks on slaves (for example, when my framework > scheduler crashes or is > temporarily unavailable)? > > Thanks for your input! > > Best Regards, > Stephan > > > > > > > >