Re: Design doc: Agent draining and deprecation of maintenance primitives

Vinod Kone Fri, 14 Jun 2019 07:22:00 -0700

+1

Thanks,
Vinod


> On Jun 14, 2019, at 9:18 AM, Greg Mann <g...@mesosphere.io> wrote:
> 
> Hi all,
> Myself and a few other committers spent some time revisiting the possibility 
> of implementing agent draining using maintenance windows, as well as 
> discussing the coexistence of the existing maintenance primitives with the 
> agent draining feature as it is currently designed. Ultimately, the use case 
> of an operator putting an agent into a draining state immediately and 
> indefinitely, with no concept of a maintenance window, seems to be valid. 
> That use case is a bit awkward to represent in terms of our existing 
> maintenance windows. So, our thought is that we can add the agent draining 
> feature as it is currently designed, in order to provide an automatic agent 
> draining primitive. We can then later on extend the maintenance schedules to 
> allow operators to specify that they would like to automatically drain agents 
> leading up to the maintenance window. At that point, we could make use of the 
> agent draining primitive to accomplish this.
> 
> For the time being, we would like to disallow any single agent from both 
> being present in the maintenance schedule and being put into an automatic 
> draining state. This gives us some time to figure out precisely how these two 
> features will interact so that we avoid the need to make breaking changes 
> down the road.
> 
> Let me know what you all think of the above plan. I like it because it allows 
> operators who are currently using the maintenance primitives to continue 
> doing so, accommodates the simple case of immediate agent draining in the 
> near future, and allows us to incorporate automatic draining into the 
> maintenance schedule later.
> 
> Cheers,
> Greg
> 
>> On Fri, Jun 14, 2019 at 4:18 PM Greg Mann <g...@mesosphere.io> wrote:
>> Christoph,
>> Great to hear that you're using the maintenance primitives! It seems unwise 
>> for us to deprecate this part of the API given the fact that you and Maxime 
>> have both expressed a desire for it to stick around. I'll adjust the agent 
>> draining design doc to remove the deprecation of that feature. Many thanks 
>> for your feedback.
>> 
>> Greg
>> 
>>> On Fri, Jun 7, 2019 at 9:24 PM Heer, Christoph <christoph.h...@sap.com> 
>>> wrote:
>>> Hi everyone,
>>> 
>>> my team and I implemented our own Mesos framework for task execution on our 
>>> bare-metal on-prem cluster.
>>> Especially for task processing workload with known or estimated task 
>>> duration, the available Mesos maintenance primitives are super powerful for 
>>> scheduler and operators. While developing the scheduler, I hadn't the 
>>> feeling it would be complex to support/respect maintenance windows. Already 
>>> the small logic "Should I launch task X with estimated runtime 3h on node Y 
>>> with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our 
>>> hardware operations team also really likes the way to plan and express 
>>> maintenance windows upfront. Days before the actually maintenance they can 
>>> add the information and the node will be ready at that point in time. Also, 
>>> they can reboot the machines without the fear that any production workload 
>>> will be scheduled until they confirmed the end of the maintenance. But 
>>> looks like this would be also ensured by the new design.
>>> 
>>> In the past we already used another job orchestration system with a 
>>> draining approach similar to the design proposal. In nearly all cases the 
>>> operations team didn't manage to start the draining mode at the right time. 
>>> Either it was too early, and we didn't use available hardware resources or 
>>> it was too late and it unnecessarily interrupted productive workload. 
>>> Especially for long-running tasks which are expensive at restarting, it 
>>> wasn't a good way to mange scheduled down times.
>>> 
>>> I don't know the implementation within Mesos and therefore I can't judge 
>>> about the complexity but I think the main problem is that Mesos doesn't 
>>> provide an intuitive interface for managing maintenance windows. The HTTP 
>>> API isn't that complicated but you definitely need own or external tooling. 
>>> Probably most people are already deterred from the JSON syntax with 
>>> nanoseconds. Also, the lack of synchronisation of modifications can be a 
>>> problem and makes it harder to implement tooling around the API. A new more 
>>> fine-grain HTTP API would be a big improvement and would allow to implement 
>>> a nice looking interface within the Mesos UI.
>>> 
>>> It would be sad to see this great feature disappearing.
>>> 
>>> Best regards,
>>> Christoph
>>> 
>>> 
>>> Christoph Heer
>>> SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
>>> 
>>> Mandatory Disclosure Statement: www.sap.com/impressum
>>> This e-mail may contain trade secrets or privileged, undisclosed, or 
>>> otherwise 
>>> confidential information. If you have received this e-mail in error, you 
>>> are hereby 
>>> notified that any review, copying, or distribution of it is strictly 
>>> prohibited. Please inform 
>>> us immediately and destroy the original transmittal. Thank you for your 
>>> cooperation.
>>> 
>>> 
>>> > On 7. Jun 2019, at 09:56, Maxime Brugidou <maxime.brugi...@gmail.com> 
>>> > wrote:
>>> > 
>>> > I think that you are both correct about the fact that most users don't 
>>> > and won't use the schedules to plan maintenance in advance. The main 
>>> > reason is that frameworks just don't use this schedule and don't take 
>>> > inverse offers into account, but also most use cases are Ok with simply 
>>> > draining nodes one after the other without more logic.
>>> > 
>>> > In the end Benjamin is right we are always hitting the same problem with 
>>> > Mesos, there is no good reference implementation of a scheduler, with all 
>>> > the features baked in. On our side we are mostly using open source 
>>> > schedulers (Marathon, Aurora, Flink...etc) for various use cases and they 
>>> > mostly don't leverage maintenance primitives. We started to use them for 
>>> > one custom use case where we are indeed building our own framework, and 
>>> > we want to provide some sort of "task duration" SLA which would clearly 
>>> > benefit from maintenance schedules. Honestly if we are the only users 
>>> > doing that, we can maintain the schedules on a separate service easily. I 
>>> > haven't seen any framework actually using the inverse offers though.
>>> > 
>>> > I also agree that adding offer ordering in the allocator is probably not 
>>> > the best design since what we want in the end is probably some 
>>> > affinity/anti-affinity at the scheduler level based on the "time to 
>>> > reboot" for example. But again, this need cooperation from frameworks. My 
>>> > idea was more of a hack/prototype idea since I see that slaves are 
>>> > randomly sorted in the allocator and we could easily patch it to have a 
>>> > custom sort mechanism. But I completely agree that optimistic offers or 
>>> > similar techniques are the way to go.
>>> > 
>>> > I don't think that we will ever get to the point of having a reference 
>>> > scheduler, the Mesos community would need to agree on one implementation 
>>> > and make sure that every new feature of Mesos gets implemented in the 
>>> > scheduler. This is a huge amount of work and coordination/design. The 
>>> > mesosphere dcos-commons library is one example of the complexity of such 
>>> > a project, it is dedicated to stateful services, is clearly coupled with 
>>> > DC/OS (although we are able to use it on bare Mesos too), and it's still 
>>> > difficult to use. However, having an open source scheduler exposing a 
>>> > higher-level friendly API via RPC (like kubernetes for example), is 
>>> > probably the only way to make Mesos more accessible for most users.
>>> > 
>>> > On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler <bmah...@apache.org> wrote:
>>> > > With the new proposal, it's going to be as difficult as before to have 
>>> > > SLA-aware maintenances because it will need cooperation from the 
>>> > > frameworks anyway and we know this is rarely a priority for them. We 
>>> > > will also lose the ability to signal future maintenance in order to 
>>> > > optimize allocations.
>>> > 
>>> > Personally, I think right now we should solve the basic need of draining 
>>> > a node. The plan to add SLA-awareness into draining was to introduce a 
>>> > capability that schedulers opt into that enables them to (1) take control 
>>> > over the killing of tasks when an agent is put into the draining state 
>>> > and (2) still get offers when an agent is the draining state in case the 
>>> > scheduler needs to restart a task that *must* run. This allows an 
>>> > SLA-aware scheduler to avoid killing during a drain if its task(s) will 
>>> > have SLAs violated.
>>> > 
>>> > Perhaps this functionality can live alongside the maintenance schedule 
>>> > information we currently support, without being coupled together. As far 
>>> > as I'm aware that's something we hadn't considered (we considered 
>>> > integrating into the maintenance schedules or replacing them).
>>> > 
>>> > > For example I had this idea to improve the allocator (or write a custom 
>>> > > one) that would offer resources from agents with no maintenance planned 
>>> > > in priority, and then sort agents by maintenance date in decremasing 
>>> > > order.
>>> > 
>>> > Right now there is no meaning to the order of offers. Adding some meaning 
>>> > to the ordering of offers quickly becomes an issue for us as soon as 
>>> > there are multiple criteria that need to be evaluated. For example, if 
>>> > you want to incorporate maintenance, load spreading, fault domain 
>>> > spreading, etc across machines, it becomes less clear how offers should 
>>> > be ordered. One could try to build some scoring model in mesos for 
>>> > ordering, but it will be woefully inadequate since Mesos does not know 
>>> > anything about the pending workloads: it's ultimately the schedulers that 
>>> > are best positioned to make these decisions. This is why we are going to 
>>> > move towards an "optimistic concurrency" model where schedulers can 
>>> > choose what they want and Mesos enforces constraints (e.g. quota limits), 
>>> > thereby eliminating the multi-scheduler scalability issues of the current 
>>> > offer model.
>>> > 
>>> > And as somewhat of an aside, the lack of built-in scheduling has been bad 
>>> > for the Mesos ecosystem. The vast majority of users just need to 
>>> > schedule: services, jobs and cron jobs. These have a pretty standard look 
>>> > and feel (including the SLA aspect of them!). Many of the existing 
>>> > schedulers could be thinner "orchestrators" that know when to submit 
>>> > something to be scheduled by a common scheduler, rather than 
>>> > reimplementing all of the typical scheduling primitives (constraints, SLA 
>>> > awareness, dealing with the low level mesos scheduling API). My point 
>>> > here is that we ask too much of frameworks and it hurts users. I would 
>>> > love to see scheduling become more standardized and built into Mesos.
>>> > 
>>> > On Thu, Jun 6, 2019 at 10:52 AM Greg Mann <g...@mesosphere.io> wrote:
>>> > Maxime,
>>> > Thanks for the feedback, it's much appreciated. I agree that it would be 
>>> > possible to evolve the existing primitives to accomplish something 
>>> > similar to the proposal. That is one option that was considered before 
>>> > writing the design doc, but after some discussion, I thought that it 
>>> > seems more appropriate to start over with a simpler model that 
>>> > accomplishes what we perceive to be the predominant use case: the 
>>> > automated draining of agent nodes, without the concept of a maintenance 
>>> > window or designated maintenance time in the future. However, perhaps 
>>> > this perception is incorrect?
>>> > 
>>> > Using maintenance metadata to alter the sorting order in the allocator is 
>>> > an interesting idea; currently, the allocator does not have access to 
>>> > information about maintenance, but it's conceivable that we could extend 
>>> > the allocator interface to accommodate this. While the currently-proposed 
>>> > design would not allow this, it would allow operators to deactivate 
>>> > nodes, which is an extreme version of this, since deactivated agents 
>>> > would never have their resources offered to frameworks. This provides a 
>>> > blunt mechanism to prevent scheduling on nodes which have upcoming 
>>> > maintenance, although it sounds like you see some benefit to a more 
>>> > subtle notion of scheduling priority based on upcoming maintenance? Do 
>>> > you think that maintenance-aware sorting would provide much more benefit 
>>> > to you over agent deactivation? Do you make use of the existing 
>>> > maintenance primitives to signal upcoming maintenance on agents?
>>> > 
>>> > Thanks!
>>> > Greg
>>> > 
>>> > On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou 
>>> > <maxime.brugi...@gmail.com> wrote:
>>> > Hi,
>>> > 
>>> > As a Mesos operator, I am really surprised by this proposal.
>>> > 
>>> > The main advantage of the proposed design is that we can finally set 
>>> > nodes down for maintenance with a configurable kill grace period and a 
>>> > proper task status (with maintenance primitives, it was TASK_LOST I 
>>> > think) without any specific cooperation from the frameworks.
>>> > 
>>> > I think that this could be just an evolution of the current primitives.
>>> > 
>>> > With the new proposal, it's going to be as difficult as before to have 
>>> > SLA-aware maintenances because it will need cooperation from the 
>>> > frameworks anyway and we know this is rarely a priority for them. We will 
>>> > also lose the ability to signal future maintenance in order to optimize 
>>> > allocations.
>>> > 
>>> > For example I had this idea to improve the allocator (or write a custom 
>>> > one) that would offer resources from agents with no maintenance planned 
>>> > in priority, and then sort agents by maintenance date in decremasing 
>>> > order. This would be a big improvement to prevent cluster reboots to 
>>> > trigger too many task restarts. This will not be possible with the new 
>>> > primitives. The same idea apply for frameworks too.
>>> > 
>>> > Maxime
>>> > 
>>> > Le jeu. 30 mai 2019 à 22:16, Joseph Wu <jos...@mesosphere.io> a écrit :
>>> > As far as I can tell, the document is public.
>>> > 
>>> > On Thu, May 30, 2019 at 12:22 AM Marc Roos <m.r...@f1-outsourcing.eu> 
>>> > wrote:
>>> >  
>>> > Is the doc not public?
>>> > 
>>> > 
>>> > -----Original Message-----
>>> > From: Joseph Wu [mailto:jos...@mesosphere.io] 
>>> > Sent: donderdag 30 mei 2019 2:07
>>> > To: dev; user
>>> > Subject: Design doc: Agent draining and deprecation of maintenance 
>>> > primitives
>>> > 
>>> > Hi all,
>>> > 
>>> > A few years back, we added some constructs called maintenance primitives 
>>> > to Mesos.  This feature was meant to allow operators and frameworks to 
>>> > cooperate in draining tasks off nodes scheduled for maintenance.  As far 
>>> > as we've observed since, this feature never achieved enough adoption to 
>>> > be useful for operators.
>>> > 
>>> > As such, we are proposing a more opinionated approach for draining 
>>> > tasks.  The goal is to have Mesos perform draining in lieu of 
>>> > frameworks, minimizing or eliminating the need to change frameworks to 
>>> > account for draining.  We will also be simplifying the operator 
>>> > workflow, which would only require a single call (holding an AgentID) to 
>>> > start draining; and a single call to bring an agent back into the 
>>> > cluster.
>>> > 
>>> > 
>>> > Due to how closely this proposed feature overlaps with maintenance 
>>> > primitives, we will be deprecating maintenance primitives upon 
>>> > implementation of agent draining.
>>> > 
>>> > 
>>> > If interested, please take a look at the design document:
>>> > 
>>> > https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
>>> > 
>>> > 
>>>

Re: Design doc: Agent draining and deprecation of maintenance primitives

Reply via email to