Although I am not as familiar with Marathon specifics, in general,

1. a rogue job can potentially render slaves useless, and,
2. a rogue slave (or rather a rogue executor) can blackhole jobs via false
positive completions

A strategy that helps with #1 is to limit the number of re-launches of an
individual job/task upon failure. Even better if this is done with failure
rate. Simple rate limiting may only delay the problem for a while.
A strategy that helps with #2 is to "disable" the slave from further
launches when too many failures are reported from it in a given time
period. This can render many slaves disabled and reduce cluster throughput
(which should alert the operator), which is better than falsely putting all
jobs into completion state.

An out of band monitor that watches job/task lifecycle events can achieve
this, for example, using a stream processing technique over the continuous
event stream.

Sharma



On Fri, May 2, 2014 at 10:35 AM, Dick Davies <d...@hellooperator.net> wrote:

> Not quite - looks to me like mesos slave disks filled with failed jobs
> (because marathon
> continued to throw a broken .zip into them) and with /tmp on the root
> fs the servers became
> unresponsive. Tobi mentions there's a way to set that at deploy time,
> but in this case the
> guy who can't type 'hello world' correctly would have been responsible
> for setting the rate limits
> too (that's me by the way!) so in itself that's not protection from pilot
> error.
>
> I'm not sure if GC was able to clear /var any better (I doubt it very
> much, my impression
> was that's on the order of days). Think it's more the deploy could be
> cancelled better while the
> system was still functioning (speculation - i'm still in early stages
> of learning the internals of this).
>
> On 30 April 2014 22:08, Vinod Kone <vinodk...@gmail.com> wrote:
> > Dick, I've also briefly skimmed at your original email to marathon
> mailing
> > list and it sounded like executor sandboxes were not getting garbage
> > collected (a mesos feature) when the slave work directory was rooted in
> /tmp
> > vs /var? Did I understand that right? If yes, I would love to see some
> logs.
> >
> >
> > On Wed, Apr 30, 2014 at 1:51 PM, Tobias Knaup <t...@knaup.me> wrote:
> >>
> >> In Marathon you can specify taskRateLimit (max number of tasks to start
> >> per second) as part of your app definition.
> >>
> >>
> >> On Wed, Apr 30, 2014 at 11:30 AM, Dick Davies <d...@hellooperator.net>
> >> wrote:
> >>>
> >>> Managed to take out a mesos slave today with a typo while launching
> >>> a marathon app, and wondered if there are throttles/limits that can be
> >>> applied to repeated launches to limit the risk of such mistakes in the
> >>> future.
> >>>
> >>> I started a thread on the marathon list
> >>>  (
> >>>
> https://groups.google.com/forum/?hl=en#!topic/marathon-framework/4iWLqTYTvgM
> >>> )
> >>>
> >>> [ TL:DR: marathon throws an app that will never deploy correctly at
> >>> slaves
> >>> until the disk fills with debris and the slave dies ]
> >>>
> >>> but I suppose this could be something available in mesos itself.
> >>>
> >>> I can't find a lot of advice about operational aspects of Mesos admin;
> >>> could others here provide some good advice about their experience in
> >>> preventing failed task deploys from causing trouble on their clusters?
> >>>
> >>> Thanks!
> >>
> >>
> >
>

Reply via email to