Re: Understanding Mesos Maintenance

Zameer Manji Fri, 03 Mar 2017 18:35:05 -0800

Thanks for clearing that up.

I was accidentally setting a long refuse time.


On Fri, Mar 3, 2017 at 6:08 PM, Joseph Wu <jos...@mesosphere.io> wrote:

> Inverse offers have the same offer cycle as normal offers.  They can
> be Accepted/Declined with a timeout (default 5 seconds).
>
> On Fri, Mar 3, 2017 at 5:29 PM, Zameer Manji <zma...@apache.org> wrote:
> > Ben,
> >
> > Thanks for responding to my questions. I have a follow up on #3.
> >
> > I have a framework which accepts inverse offers but does not do anything
> to
> > the associated tasks. I noticed that the framework **does not** receive
> > another inverse offer  within the allocation period. At what interval
> will
> > an inverse offer be resent to the framework if it was accepted? I took a
> > glance at `src/tests/master_maintenance_tests.cpp` and did not notice
> any
> > tests testing for this.
> >
> > Are you sure that inverse offers are resent after they have been accepted
> > but before the tasks are removed from the host?
> >
> >
> > On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
> >>
> >> Hey Zameer, great questions. Let us know if there's anything you think
> >> could be improved or documented better.
> >>
> >> Re 1:
> >>
> >> The 'Viewing maintenance status' section of the documentation should
> >> clarify this:
> >> http://mesos.apache.org/documentation/latest/maintenance/
> >>
> >> Re 2:
> >>
> >> Both of these sound reasonable but the scheduler should not accept the
> >> maintenance if it's not yet safe for the machine to be downed.
> Otherwise a
> >> task failure may be mistakenly interpreted as a go ahead to down the
> >> machine, despite the scheduler needing to get the task back running. If
> >> expensive or long running work needs to finish (e.g. migrate data,
> replace
> >> instances in a manner that doesn't violate SLA, etc.) then I would
> suggest
> >> waiting until the work completes safely before accepting.
> >>
> >> We likely need a third state like, TENTATIVELY_ACCEPT to signal to
> >> operators / mesos that the framework intends to comply, but hasn't
> finished
> >> whatever it needs to do yet for it to be safe to down the machine.
> >>
> >> Also, one of the challenges here is when to take the action. Should the
> >> scheduler prepare itself for maintenance as soon as it safely can? Or as
> >> late (but not too late!) as it safely can? If the scheduler runs
> >> long-running services, as soon as safely possible makes sense. If the
> >> scheduler runs short running batch jobs, as late as safely possible
> provides
> >> work-conservation.
> >>
> >> Re 3:
> >>
> >> The framework will receive another inverse offer if the framework still
> >> has resources allocated on that agent. If receiving a regular offer for
> >> available resources on the agent, an 'Unavailability' [1] will be
> included
> >> if the machine is scheduled for maintenance, so that the scheduler can
> be
> >> aware of the maintenance when placing new work.
> >>
> >> Re 4:
> >>
> >> It's not possible currently, and it's the operator's responsibility (the
> >> intention was for "operator" to be maintenance tooling). Ideally we can
> add
> >> automation of this decision into mesos, if decision criteria that is
> widely
> >> applicable can be established (e.g. if nothing is running and all
> relevant
> >> frameworks have accepted). Feel free to file a ticket for this or any
> other
> >> improvements!
> >>
> >> Ben
> >>
> >> [1]
> >> https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27b040
> 4279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426
> >>
> >> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji <zma...@apache.org> wrote:
> >>>
> >>> Hey,
> >>>
> >>> I'm trying to understand some nuances of the maintenance API. Here are
> my
> >>> questions:
> >>>
> >>> 1. The documentation mentions that accepting or declining and inverse
> >>> offer is a "hint" to the operator. How do operators view if a
> framework has
> >>> declined, accepted or ignored an inverse offer?
> >>>
> >>> 2. Should a framework accept an inverse offer and then start removing
> >>> tasks from an agent or should the framework only accept the inverse
> offer
> >>> after the removal of tasks is complete? I think the former makes
> sense, but
> >>> it implies that operators need to poll the state of the agent to ensure
> >>> there are no active tasks whereas the latter implies operators only
> need to
> >>> check if all inverse offers were accepted.
> >>>
> >>> 3. After accepting the inverse offer, will a framework get another
> >>> inverse offer for the same agent? Currently I'm trying to determine if
> >>> inverse offer information needs to be persisted so a framework can
> continue
> >>> it's draining work between failovers or if it can just wait for an
> inverse
> >>> offer after starting up.
> >>>
> >>> 4. Is it possible for the agent to automatically transition from DRAIN
> to
> >>> DOWN if at the start of the unavailability period the agent is free of
> tasks
> >>> or is that still the operator's responsibility?
> >>>
> >>> --
> >>> Zameer Manji
> >>>
> >>> --
> >>> Zameer Manji
>
> --
> Zameer Manji
>

Re: Understanding Mesos Maintenance

Reply via email to