Hello Benjamin, Sure thing! I will file a ticket and do the patch.
Thank you, Clément. On Thu, Oct 11, 2018 at 9:39 PM Benjamin Mahler <bmah...@apache.org> wrote: > Thanks for the thorough explanation. > > Yes, it sounds acceptable and useful for assigning disk i/o and network > i/o. The error case of there not being enough resources post-injection > seems unfortunate but I don't see a way around it. > > Can you file a ticket with this background? > > On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud < > clement.michau...@gmail.com> > wrote: > > > Hello, > > > > TL;DR; we have added network bandwidth as a first-class resource in our > > clusters with a custom isolator and we have patched Mesos master to > > introduce the concept of implicit allocation of custom resources to make > > network bandwidth mandatory for all tasks. I'd like to know what you > think > > about what we have implemented and if you think we could introduce a new > > hook with the aim of injecting mandatory custom resources to tasks in > Mesos > > master. > > > > > > At Criteo we have implemented a custom solution in our Mesos clusters to > > prevent network noisy neighbors and to allow our users to define a custom > > amount of reserved network bandwidth per application. Please note we run > > our clusters on a flat network and we are not using any kind of network > > overlay. > > > > In order to address these use cases, we enabled the `net_cls` isolator > and > > wrote an isolator using tc, conntrack and iptables, each container > having a > > dedicated custom reserved amount of network bandwidth declared by > > configuration in Marathon or Aurora. > > > > In the first implementation of our solution, the resources were not > > declared in the agents and obviously not taken into account by Mesos but > > the isolator allocated an amount of network bandwidth for each task > > relative to the number of reserved CPUs and the number of available CPUs > on > > the server. Basically, the per container network bandwidth limitation was > > applied but Mesos was not aware of it. Using the CPU as a proxy for the > > amount of network bandwidth protected us from situations where an agent > > could allocate more network bandwidth than available on the agent. > However, > > this model reached its limits when we introduced big consumers of network > > bandwidth in our clusters. They had to raise the number of CPUs to get > more > > network bandwidth and therefore it introduced scheduling issues. > > > > Hence, we decided to leverage Mesos custom resources to let our users > > declare their requirements but also to decouple network bandwidth from > CPU > > to avoid scheduling issues. We first declared the network bandwidth > > resource on every Mesos agents even if tasks were not declaring any. > Then, > > we faced a first issue: the lack of support of network bandwidth and/or > > custom resources in Marathon and Aurora (well it seems in most frameworks > > actually). This led to a second issue: we needed Mesos to account for the > > network bandwidth of all tasks even if some frameworks were not > supporting > > it yet. Solving the second problem allowed us to run a smooth migration > by > > patching frameworks independently in a second phase. > > > > On the way we found out that the “custom resources” system wasn’t meeting > > our needs, because it only allows for “optional resources”, and not > > “mandatory resources” (resources that should be accounted for all tasks > in > > a cluster, even if not required explicitly, like CPU, RAM or disk space. > > Good candidates are network bandwidth or disk I/O for instance). > > > > To enforce the usage of network bandwidth across all tasks we wanted to > > allocate an implicit amount of network bandwidth to tasks not declaring > any > > in their configuration. One possible implementation was to make the Mesos > > master automatically compute the allocated network bandwidth for the task > > when the offer is accepted and subtract this amount from the overall > > available resources in Mesos. We consider this implicit allocation as a > > fallback mechanism for frameworks not supporting "mandatory" resources. > > Indeed, in a proper environment all frameworks would support these > > mandatory resources. Unfortunately, adding support for a new resource (or > > for custom resources) in all frameworks might not be manageable in a > timely > > manner, especially in an ecosystem with multiple frameworks. > > > > Consequently, we wrote a patch in Mesos master to allocate an implicit > > amount of network bandwidth when it is not provided in the TaskInfo. In > our > > case this implicit amount is computed based on the following Criteo > > specific rule: `task_used_cpu / slave_total_cpus * > slave_total_bandwidth`. > > > > Here is what happened when our frameworks were not supporting network > > bandwidth yet: offers were sent to frameworks and they accepted or > rejected > > them regardless of network bandwidth available on the slave. When an > offer > > was accepted, the TaskInfo sent by the framework obviously did not > contain > > any network bandwidth but Mesos master implicitly injected some and let > the > > task follow its way. There were two cases then: either the slave had > enough > > resources to run the task and it was scheduled as expected or it did not > > have enough resources and the task failed to be deployed and Mesos sent > > back a TASK_ERROR to the framework. It was the responsibility of the > > scheduler to retry with following offers. This solution created a bit of > > extra work for the master but we tested it and ran it in production for > few > > weeks in several clusters of around 250 servers each and it seemed to > work > > well, at least with Marathon and Aurora. At this point the migration was > > expected to be smooth because it only required a restart of all the tasks > > for network bandwidth to be introduced cluster wide. It ended up being as > > smooth as expected. > > > > In the meantime, we obviously patched Marathon and Aurora to add full > > support of network bandwidth and avoid the potential TASK_ERROR messages > > while keeping in mind that we'll soon host other frameworks that would > > probably not support network bandwidth from the beginning. So we'll > likely > > keep our patch in the future and we think it might be a good idea to > > introduce a hook in Mesos master to add implicit resources to tasks. > > > > What we propose is to introduce a method called > > masterLaunchTaskResourceDecorator in the hook interface and call it at > the > > right location to let the user add whatever implicit resource he wants. > > > > This would give the following signature: > > > > ``` > > > > Result<Resources> masterLaunchTaskResourceDecorator( > > > > const Resources& slaveResources, > > > > TaskInfo& task) > > > > ``` > > > > Can you please tell us if such an integration point would be acceptable > to > > be merged upstream? > > > > > > You can have a look at our current implementation here: > > > > > https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth > > (just as a reference, it is not the base of a patch for upstream mesos). > > > > Thank you, > > > > Clément. > > >