Hello, TL;DR; we have added network bandwidth as a first-class resource in our clusters with a custom isolator and we have patched Mesos master to introduce the concept of implicit allocation of custom resources to make network bandwidth mandatory for all tasks. I'd like to know what you think about what we have implemented and if you think we could introduce a new hook with the aim of injecting mandatory custom resources to tasks in Mesos master.
At Criteo we have implemented a custom solution in our Mesos clusters to prevent network noisy neighbors and to allow our users to define a custom amount of reserved network bandwidth per application. Please note we run our clusters on a flat network and we are not using any kind of network overlay. In order to address these use cases, we enabled the `net_cls` isolator and wrote an isolator using tc, conntrack and iptables, each container having a dedicated custom reserved amount of network bandwidth declared by configuration in Marathon or Aurora. In the first implementation of our solution, the resources were not declared in the agents and obviously not taken into account by Mesos but the isolator allocated an amount of network bandwidth for each task relative to the number of reserved CPUs and the number of available CPUs on the server. Basically, the per container network bandwidth limitation was applied but Mesos was not aware of it. Using the CPU as a proxy for the amount of network bandwidth protected us from situations where an agent could allocate more network bandwidth than available on the agent. However, this model reached its limits when we introduced big consumers of network bandwidth in our clusters. They had to raise the number of CPUs to get more network bandwidth and therefore it introduced scheduling issues. Hence, we decided to leverage Mesos custom resources to let our users declare their requirements but also to decouple network bandwidth from CPU to avoid scheduling issues. We first declared the network bandwidth resource on every Mesos agents even if tasks were not declaring any. Then, we faced a first issue: the lack of support of network bandwidth and/or custom resources in Marathon and Aurora (well it seems in most frameworks actually). This led to a second issue: we needed Mesos to account for the network bandwidth of all tasks even if some frameworks were not supporting it yet. Solving the second problem allowed us to run a smooth migration by patching frameworks independently in a second phase. On the way we found out that the “custom resources” system wasn’t meeting our needs, because it only allows for “optional resources”, and not “mandatory resources” (resources that should be accounted for all tasks in a cluster, even if not required explicitly, like CPU, RAM or disk space. Good candidates are network bandwidth or disk I/O for instance). To enforce the usage of network bandwidth across all tasks we wanted to allocate an implicit amount of network bandwidth to tasks not declaring any in their configuration. One possible implementation was to make the Mesos master automatically compute the allocated network bandwidth for the task when the offer is accepted and subtract this amount from the overall available resources in Mesos. We consider this implicit allocation as a fallback mechanism for frameworks not supporting "mandatory" resources. Indeed, in a proper environment all frameworks would support these mandatory resources. Unfortunately, adding support for a new resource (or for custom resources) in all frameworks might not be manageable in a timely manner, especially in an ecosystem with multiple frameworks. Consequently, we wrote a patch in Mesos master to allocate an implicit amount of network bandwidth when it is not provided in the TaskInfo. In our case this implicit amount is computed based on the following Criteo specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`. Here is what happened when our frameworks were not supporting network bandwidth yet: offers were sent to frameworks and they accepted or rejected them regardless of network bandwidth available on the slave. When an offer was accepted, the TaskInfo sent by the framework obviously did not contain any network bandwidth but Mesos master implicitly injected some and let the task follow its way. There were two cases then: either the slave had enough resources to run the task and it was scheduled as expected or it did not have enough resources and the task failed to be deployed and Mesos sent back a TASK_ERROR to the framework. It was the responsibility of the scheduler to retry with following offers. This solution created a bit of extra work for the master but we tested it and ran it in production for few weeks in several clusters of around 250 servers each and it seemed to work well, at least with Marathon and Aurora. At this point the migration was expected to be smooth because it only required a restart of all the tasks for network bandwidth to be introduced cluster wide. It ended up being as smooth as expected. In the meantime, we obviously patched Marathon and Aurora to add full support of network bandwidth and avoid the potential TASK_ERROR messages while keeping in mind that we'll soon host other frameworks that would probably not support network bandwidth from the beginning. So we'll likely keep our patch in the future and we think it might be a good idea to introduce a hook in Mesos master to add implicit resources to tasks. What we propose is to introduce a method called masterLaunchTaskResourceDecorator in the hook interface and call it at the right location to let the user add whatever implicit resource he wants. This would give the following signature: ``` Result<Resources> masterLaunchTaskResourceDecorator( const Resources& slaveResources, TaskInfo& task) ``` Can you please tell us if such an integration point would be acceptable to be merged upstream? You can have a look at our current implementation here: https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth (just as a reference, it is not the base of a patch for upstream mesos). Thank you, Clément.