Thanks for clearing that up. I was accidentally setting a long refuse time.
On Fri, Mar 3, 2017 at 6:08 PM, Joseph Wu <jos...@mesosphere.io> wrote: > Inverse offers have the same offer cycle as normal offers. They can > be Accepted/Declined with a timeout (default 5 seconds). > > On Fri, Mar 3, 2017 at 5:29 PM, Zameer Manji <zma...@apache.org> wrote: > > Ben, > > > > Thanks for responding to my questions. I have a follow up on #3. > > > > I have a framework which accepts inverse offers but does not do anything > to > > the associated tasks. I noticed that the framework **does not** receive > > another inverse offer within the allocation period. At what interval > will > > an inverse offer be resent to the framework if it was accepted? I took a > > glance at `src/tests/master_maintenance_tests.cpp` and did not notice > any > > tests testing for this. > > > > Are you sure that inverse offers are resent after they have been accepted > > but before the tasks are removed from the host? > > > > > > On Thu, Mar 2, 2017 at 4:14 PM, Benjamin Mahler <bmah...@apache.org> > wrote: > >> > >> Hey Zameer, great questions. Let us know if there's anything you think > >> could be improved or documented better. > >> > >> Re 1: > >> > >> The 'Viewing maintenance status' section of the documentation should > >> clarify this: > >> http://mesos.apache.org/documentation/latest/maintenance/ > >> > >> Re 2: > >> > >> Both of these sound reasonable but the scheduler should not accept the > >> maintenance if it's not yet safe for the machine to be downed. > Otherwise a > >> task failure may be mistakenly interpreted as a go ahead to down the > >> machine, despite the scheduler needing to get the task back running. If > >> expensive or long running work needs to finish (e.g. migrate data, > replace > >> instances in a manner that doesn't violate SLA, etc.) then I would > suggest > >> waiting until the work completes safely before accepting. > >> > >> We likely need a third state like, TENTATIVELY_ACCEPT to signal to > >> operators / mesos that the framework intends to comply, but hasn't > finished > >> whatever it needs to do yet for it to be safe to down the machine. > >> > >> Also, one of the challenges here is when to take the action. Should the > >> scheduler prepare itself for maintenance as soon as it safely can? Or as > >> late (but not too late!) as it safely can? If the scheduler runs > >> long-running services, as soon as safely possible makes sense. If the > >> scheduler runs short running batch jobs, as late as safely possible > provides > >> work-conservation. > >> > >> Re 3: > >> > >> The framework will receive another inverse offer if the framework still > >> has resources allocated on that agent. If receiving a regular offer for > >> available resources on the agent, an 'Unavailability' [1] will be > included > >> if the machine is scheduled for maintenance, so that the scheduler can > be > >> aware of the maintenance when placing new work. > >> > >> Re 4: > >> > >> It's not possible currently, and it's the operator's responsibility (the > >> intention was for "operator" to be maintenance tooling). Ideally we can > add > >> automation of this decision into mesos, if decision criteria that is > widely > >> applicable can be established (e.g. if nothing is running and all > relevant > >> frameworks have accepted). Feel free to file a ticket for this or any > other > >> improvements! > >> > >> Ben > >> > >> [1] > >> https://github.com/apache/mesos/blob/8f487beb9f8aaed8f27b040 > 4279b1a2f97672ba1/include/mesos/v1/mesos.proto#L1416-L1426 > >> > >> On Wed, Mar 1, 2017 at 5:41 PM, Zameer Manji <zma...@apache.org> wrote: > >>> > >>> Hey, > >>> > >>> I'm trying to understand some nuances of the maintenance API. Here are > my > >>> questions: > >>> > >>> 1. The documentation mentions that accepting or declining and inverse > >>> offer is a "hint" to the operator. How do operators view if a > framework has > >>> declined, accepted or ignored an inverse offer? > >>> > >>> 2. Should a framework accept an inverse offer and then start removing > >>> tasks from an agent or should the framework only accept the inverse > offer > >>> after the removal of tasks is complete? I think the former makes > sense, but > >>> it implies that operators need to poll the state of the agent to ensure > >>> there are no active tasks whereas the latter implies operators only > need to > >>> check if all inverse offers were accepted. > >>> > >>> 3. After accepting the inverse offer, will a framework get another > >>> inverse offer for the same agent? Currently I'm trying to determine if > >>> inverse offer information needs to be persisted so a framework can > continue > >>> it's draining work between failovers or if it can just wait for an > inverse > >>> offer after starting up. > >>> > >>> 4. Is it possible for the agent to automatically transition from DRAIN > to > >>> DOWN if at the start of the unavailability period the agent is free of > tasks > >>> or is that still the operator's responsibility? > >>> > >>> -- > >>> Zameer Manji > >>> > >>> -- > >>> Zameer Manji > > -- > Zameer Manji >