Adding support for implicit allocation of mandatory custom resources in Mesos

Clément Michaud Thu, 11 Oct 2018 01:31:32 -0700

Hello,

TL;DR; we have added network bandwidth as a first-class resource in our
clusters with a custom isolator and we have patched Mesos master to
introduce the concept of implicit allocation of custom resources to make
network bandwidth mandatory for all tasks. I'd like to know what you think
about what we have implemented and if you think we could introduce a new
hook with the aim of injecting mandatory custom resources to tasks in Mesos
master.

At Criteo we have implemented a custom solution in our Mesos clusters to
prevent network noisy neighbors and to allow our users to define a custom
amount of reserved network bandwidth per application. Please note we run
our clusters on a flat network and we are not using any kind of network
overlay.

In order to address these use cases, we enabled the `net_cls` isolator and
wrote an isolator using tc, conntrack and iptables, each container having a
dedicated custom reserved amount of network bandwidth declared by
configuration in Marathon or Aurora.

In the first implementation of our solution, the resources were not
declared in the agents and obviously not taken into account by Mesos but
the isolator allocated an amount of network bandwidth for each task
relative to the number of reserved CPUs and the number of available CPUs on
the server. Basically, the per container network bandwidth limitation was
applied but Mesos was not aware of it. Using the CPU as a proxy for the
amount of network bandwidth protected us from situations where an agent
could allocate more network bandwidth than available on the agent. However,
this model reached its limits when we introduced big consumers of network
bandwidth in our clusters. They had to raise the number of CPUs to get more
network bandwidth and therefore it introduced scheduling issues.

Hence, we decided to leverage Mesos custom resources to let our users
declare their requirements but also to decouple network bandwidth from CPU
to avoid scheduling issues. We first declared the network bandwidth
resource on every Mesos agents even if tasks were not declaring any. Then,
we faced a first issue: the lack of support of network bandwidth and/or
custom resources in Marathon and Aurora (well it seems in most frameworks
actually). This led to a second issue: we needed Mesos to account for the
network bandwidth of all tasks even if some frameworks were not supporting
it yet. Solving the second problem allowed us to run a smooth migration by
patching frameworks independently in a second phase.

On the way we found out that the “custom resources” system wasn’t meeting
our needs, because it only allows for “optional resources”, and not
“mandatory resources” (resources that should be accounted for all tasks in
a cluster, even if not required explicitly, like CPU, RAM or disk space.
Good candidates are network bandwidth or disk I/O for instance).

To enforce the usage of network bandwidth across all tasks we wanted to
allocate an implicit amount of network bandwidth to tasks not declaring any
in their configuration. One possible implementation was to make the Mesos
master automatically compute the allocated network bandwidth for the task
when the offer is accepted and subtract this amount from the overall
available resources in Mesos. We consider this implicit allocation as a
fallback mechanism for frameworks not supporting "mandatory" resources.
Indeed, in a proper environment all frameworks would support these
mandatory resources. Unfortunately, adding support for a new resource (or
for custom resources) in all frameworks might not be manageable in a timely
manner, especially in an ecosystem with multiple frameworks.

Consequently, we wrote a patch in Mesos master to allocate an implicit
amount of network bandwidth when it is not provided in the TaskInfo. In our
case this implicit amount is computed based on the following Criteo
specific rule: `task_used_cpu / slave_total_cpus * slave_total_bandwidth`.

Here is what happened when our frameworks were not supporting network
bandwidth yet: offers were sent to frameworks and they accepted or rejected
them regardless of network bandwidth available on the slave. When an offer
was accepted, the TaskInfo sent by the framework obviously did not contain
any network bandwidth but Mesos master implicitly injected some and let the
task follow its way. There were two cases then: either the slave had enough
resources to run the task and it was scheduled as expected or it did not
have enough resources and the task failed to be deployed and Mesos sent
back a TASK_ERROR to the framework. It was the responsibility of the
scheduler to retry with following offers. This solution created a bit of
extra work for the master but we tested it and ran it in production for few
weeks in several clusters of around 250 servers each and it seemed to work
well, at least with Marathon and Aurora. At this point the migration was
expected to be smooth because it only required a restart of all the tasks
for network bandwidth to be introduced cluster wide. It ended up being as
smooth as expected.

In the meantime, we obviously patched Marathon and Aurora to add full
support of network bandwidth and avoid the potential TASK_ERROR messages
while keeping in mind that we'll soon host other frameworks that would
probably not support network bandwidth from the beginning. So we'll likely
keep our patch in the future and we think it might be a good idea to
introduce a hook in Mesos master to add implicit resources to tasks.

What we propose is to introduce a method called
masterLaunchTaskResourceDecorator in the hook interface and call it at the
right location to let the user add whatever implicit resource he wants.

This would give the following signature:

```

Result<Resources> masterLaunchTaskResourceDecorator(

const Resources& slaveResources,

TaskInfo& task)

```

Can you please tell us if such an integration point would be acceptable to
be merged upstream?

You can have a look at our current implementation here:
https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth
(just as a reference, it is not the base of a patch for upstream mesos).

Thank you,

Clément.

Adding support for implicit allocation of mandatory custom resources in Mesos

Reply via email to