Thanks for the thorough explanation.

Yes, it sounds acceptable and useful for assigning disk i/o and network
i/o. The error case of there not being enough resources post-injection
seems unfortunate but I don't see a way around it.

Can you file a ticket with this background?

On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud <clement.michau...@gmail.com>
wrote:

> Hello,
>
> TL;DR; we have added network bandwidth as a first-class resource in our
> clusters with a custom isolator and we have patched Mesos master to
> introduce the concept of implicit allocation of custom resources to make
> network bandwidth mandatory for all tasks. I'd like to know what you think
> about what we have implemented and if you think we could introduce a new
> hook with the aim of injecting mandatory custom resources to tasks in Mesos
> master.
>
>
> At Criteo we have implemented a custom solution in our Mesos clusters to
> prevent network noisy neighbors and to allow our users to define a custom
> amount of reserved network bandwidth per application. Please note we run
> our clusters on a flat network and we are not using any kind of network
> overlay.
>
> In order to address these use cases, we enabled the `net_cls` isolator and
> wrote an isolator using tc, conntrack and iptables, each container having a
> dedicated custom reserved amount of network bandwidth declared by
> configuration in Marathon or Aurora.
>
> In the first implementation of our solution, the resources were not
> declared in the agents and obviously not taken into account by Mesos but
> the isolator allocated an amount of network bandwidth for each task
> relative to the number of reserved CPUs and the number of available CPUs on
> the server. Basically, the per container network bandwidth limitation was
> applied but Mesos was not aware of it. Using the CPU as a proxy for the
> amount of network bandwidth protected us from situations where an agent
> could allocate more network bandwidth than available on the agent. However,
> this model reached its limits when we introduced big consumers of network
> bandwidth in our clusters. They had to raise the number of CPUs to get more
> network bandwidth and therefore it introduced scheduling issues.
>
> Hence, we decided to leverage Mesos custom resources to let our users
> declare their requirements but also to decouple network bandwidth from CPU
> to avoid scheduling issues. We first declared the network bandwidth
> resource on every Mesos agents even if tasks were not declaring any. Then,
> we faced a first issue: the lack of support of network bandwidth and/or
> custom resources in Marathon and Aurora (well it seems in most frameworks
> actually). This led to a second issue: we needed Mesos to account for the
> network bandwidth of all tasks even if some frameworks were not supporting
> it yet. Solving the second problem allowed us to run a smooth migration by
> patching frameworks independently in a second phase.
>
> On the way we found out that the “custom resources” system wasn’t meeting
> our needs, because it only allows for “optional resources”, and not
> “mandatory resources” (resources that should be accounted for all tasks in
> a cluster, even if not required explicitly, like CPU, RAM or disk space.
> Good candidates are network bandwidth or disk I/O for instance).
>
> To enforce the usage of network bandwidth across all tasks we wanted to
> allocate an implicit amount of network bandwidth to tasks not declaring any
> in their configuration. One possible implementation was to make the Mesos
> master automatically compute the allocated network bandwidth for the task
> when the offer is accepted and subtract this amount from the overall
> available resources in Mesos. We consider this implicit allocation as a
> fallback mechanism for frameworks not supporting "mandatory" resources.
> Indeed, in a proper environment all frameworks would support these
> mandatory resources. Unfortunately, adding support for a new resource (or
> for custom resources) in all frameworks might not be manageable in a timely
> manner, especially in an ecosystem with multiple frameworks.
>
> Consequently, we wrote a patch in Mesos master to allocate an implicit
> amount of network bandwidth when it is not provided in the TaskInfo. In our
> case this implicit amount is computed based on the following Criteo
> specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`.
>
> Here is what happened when our frameworks were not supporting network
> bandwidth yet: offers were sent to frameworks and they accepted or rejected
> them regardless of network bandwidth available on the slave. When an offer
> was accepted, the TaskInfo sent by the framework obviously did not contain
> any network bandwidth but Mesos master implicitly injected some and let the
> task follow its way. There were two cases then: either the slave had enough
> resources to run the task and it was scheduled as expected or it did not
> have enough resources and the task failed to be deployed and Mesos sent
> back a TASK_ERROR to the framework. It was the responsibility of the
> scheduler to retry with following offers. This solution created a bit of
> extra work for the master but we tested it and ran it in production for few
> weeks in several clusters of around 250 servers each and it seemed to work
> well, at least with Marathon and Aurora. At this point the migration was
> expected to be smooth because it only required a restart of all the tasks
> for network bandwidth to be introduced cluster wide. It ended up being as
> smooth as expected.
>
> In the meantime, we obviously patched Marathon and Aurora to add full
> support of network bandwidth and avoid the potential TASK_ERROR messages
> while keeping in mind that we'll soon host other frameworks that would
> probably not support network bandwidth from the beginning. So we'll likely
> keep our patch in the future and we think it might be a good idea to
> introduce a hook in Mesos master to add implicit resources to tasks.
>
> What we propose is to introduce a method called
> masterLaunchTaskResourceDecorator in the hook interface and call it at the
> right location to let the user add whatever implicit resource he wants.
>
> This would give the following signature:
>
> ```
>
> Result<Resources> masterLaunchTaskResourceDecorator(
>
>  const Resources& slaveResources,
>
>  TaskInfo& task)
>
> ```
>
> Can you please tell us if such an integration point would be acceptable to
> be merged upstream?
>
>
> You can have a look at our current implementation here:
>
> https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth
>  (just as a reference, it is not the base of a patch for upstream mesos).
>
> Thank you,
>
> Clément.
>

Reply via email to