Thanks for the thorough explanation. Yes, it sounds acceptable and useful for assigning disk i/o and network i/o. The error case of there not being enough resources post-injection seems unfortunate but I don't see a way around it.
Can you file a ticket with this background? On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud <clement.michau...@gmail.com> wrote: > Hello, > > TL;DR; we have added network bandwidth as a first-class resource in our > clusters with a custom isolator and we have patched Mesos master to > introduce the concept of implicit allocation of custom resources to make > network bandwidth mandatory for all tasks. I'd like to know what you think > about what we have implemented and if you think we could introduce a new > hook with the aim of injecting mandatory custom resources to tasks in Mesos > master. > > > At Criteo we have implemented a custom solution in our Mesos clusters to > prevent network noisy neighbors and to allow our users to define a custom > amount of reserved network bandwidth per application. Please note we run > our clusters on a flat network and we are not using any kind of network > overlay. > > In order to address these use cases, we enabled the `net_cls` isolator and > wrote an isolator using tc, conntrack and iptables, each container having a > dedicated custom reserved amount of network bandwidth declared by > configuration in Marathon or Aurora. > > In the first implementation of our solution, the resources were not > declared in the agents and obviously not taken into account by Mesos but > the isolator allocated an amount of network bandwidth for each task > relative to the number of reserved CPUs and the number of available CPUs on > the server. Basically, the per container network bandwidth limitation was > applied but Mesos was not aware of it. Using the CPU as a proxy for the > amount of network bandwidth protected us from situations where an agent > could allocate more network bandwidth than available on the agent. However, > this model reached its limits when we introduced big consumers of network > bandwidth in our clusters. They had to raise the number of CPUs to get more > network bandwidth and therefore it introduced scheduling issues. > > Hence, we decided to leverage Mesos custom resources to let our users > declare their requirements but also to decouple network bandwidth from CPU > to avoid scheduling issues. We first declared the network bandwidth > resource on every Mesos agents even if tasks were not declaring any. Then, > we faced a first issue: the lack of support of network bandwidth and/or > custom resources in Marathon and Aurora (well it seems in most frameworks > actually). This led to a second issue: we needed Mesos to account for the > network bandwidth of all tasks even if some frameworks were not supporting > it yet. Solving the second problem allowed us to run a smooth migration by > patching frameworks independently in a second phase. > > On the way we found out that the “custom resources” system wasn’t meeting > our needs, because it only allows for “optional resources”, and not > “mandatory resources” (resources that should be accounted for all tasks in > a cluster, even if not required explicitly, like CPU, RAM or disk space. > Good candidates are network bandwidth or disk I/O for instance). > > To enforce the usage of network bandwidth across all tasks we wanted to > allocate an implicit amount of network bandwidth to tasks not declaring any > in their configuration. One possible implementation was to make the Mesos > master automatically compute the allocated network bandwidth for the task > when the offer is accepted and subtract this amount from the overall > available resources in Mesos. We consider this implicit allocation as a > fallback mechanism for frameworks not supporting "mandatory" resources. > Indeed, in a proper environment all frameworks would support these > mandatory resources. Unfortunately, adding support for a new resource (or > for custom resources) in all frameworks might not be manageable in a timely > manner, especially in an ecosystem with multiple frameworks. > > Consequently, we wrote a patch in Mesos master to allocate an implicit > amount of network bandwidth when it is not provided in the TaskInfo. In our > case this implicit amount is computed based on the following Criteo > specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`. > > Here is what happened when our frameworks were not supporting network > bandwidth yet: offers were sent to frameworks and they accepted or rejected > them regardless of network bandwidth available on the slave. When an offer > was accepted, the TaskInfo sent by the framework obviously did not contain > any network bandwidth but Mesos master implicitly injected some and let the > task follow its way. There were two cases then: either the slave had enough > resources to run the task and it was scheduled as expected or it did not > have enough resources and the task failed to be deployed and Mesos sent > back a TASK_ERROR to the framework. It was the responsibility of the > scheduler to retry with following offers. This solution created a bit of > extra work for the master but we tested it and ran it in production for few > weeks in several clusters of around 250 servers each and it seemed to work > well, at least with Marathon and Aurora. At this point the migration was > expected to be smooth because it only required a restart of all the tasks > for network bandwidth to be introduced cluster wide. It ended up being as > smooth as expected. > > In the meantime, we obviously patched Marathon and Aurora to add full > support of network bandwidth and avoid the potential TASK_ERROR messages > while keeping in mind that we'll soon host other frameworks that would > probably not support network bandwidth from the beginning. So we'll likely > keep our patch in the future and we think it might be a good idea to > introduce a hook in Mesos master to add implicit resources to tasks. > > What we propose is to introduce a method called > masterLaunchTaskResourceDecorator in the hook interface and call it at the > right location to let the user add whatever implicit resource he wants. > > This would give the following signature: > > ``` > > Result<Resources> masterLaunchTaskResourceDecorator( > > const Resources& slaveResources, > > TaskInfo& task) > > ``` > > Can you please tell us if such an integration point would be acceptable to > be merged upstream? > > > You can have a look at our current implementation here: > > https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth > (just as a reference, it is not the base of a patch for upstream mesos). > > Thank you, > > Clément. >