Hello Benjamin,

Sure thing! I will file a ticket and do the patch.

Thank you,
Clément.

On Thu, Oct 11, 2018 at 9:39 PM Benjamin Mahler <bmah...@apache.org> wrote:

> Thanks for the thorough explanation.
>
> Yes, it sounds acceptable and useful for assigning disk i/o and network
> i/o. The error case of there not being enough resources post-injection
> seems unfortunate but I don't see a way around it.
>
> Can you file a ticket with this background?
>
> On Thu, Oct 11, 2018 at 1:30 AM Clément Michaud <
> clement.michau...@gmail.com>
> wrote:
>
> > Hello,
> >
> > TL;DR; we have added network bandwidth as a first-class resource in our
> > clusters with a custom isolator and we have patched Mesos master to
> > introduce the concept of implicit allocation of custom resources to make
> > network bandwidth mandatory for all tasks. I'd like to know what you
> think
> > about what we have implemented and if you think we could introduce a new
> > hook with the aim of injecting mandatory custom resources to tasks in
> Mesos
> > master.
> >
> >
> > At Criteo we have implemented a custom solution in our Mesos clusters to
> > prevent network noisy neighbors and to allow our users to define a custom
> > amount of reserved network bandwidth per application. Please note we run
> > our clusters on a flat network and we are not using any kind of network
> > overlay.
> >
> > In order to address these use cases, we enabled the `net_cls` isolator
> and
> > wrote an isolator using tc, conntrack and iptables, each container
> having a
> > dedicated custom reserved amount of network bandwidth declared by
> > configuration in Marathon or Aurora.
> >
> > In the first implementation of our solution, the resources were not
> > declared in the agents and obviously not taken into account by Mesos but
> > the isolator allocated an amount of network bandwidth for each task
> > relative to the number of reserved CPUs and the number of available CPUs
> on
> > the server. Basically, the per container network bandwidth limitation was
> > applied but Mesos was not aware of it. Using the CPU as a proxy for the
> > amount of network bandwidth protected us from situations where an agent
> > could allocate more network bandwidth than available on the agent.
> However,
> > this model reached its limits when we introduced big consumers of network
> > bandwidth in our clusters. They had to raise the number of CPUs to get
> more
> > network bandwidth and therefore it introduced scheduling issues.
> >
> > Hence, we decided to leverage Mesos custom resources to let our users
> > declare their requirements but also to decouple network bandwidth from
> CPU
> > to avoid scheduling issues. We first declared the network bandwidth
> > resource on every Mesos agents even if tasks were not declaring any.
> Then,
> > we faced a first issue: the lack of support of network bandwidth and/or
> > custom resources in Marathon and Aurora (well it seems in most frameworks
> > actually). This led to a second issue: we needed Mesos to account for the
> > network bandwidth of all tasks even if some frameworks were not
> supporting
> > it yet. Solving the second problem allowed us to run a smooth migration
> by
> > patching frameworks independently in a second phase.
> >
> > On the way we found out that the “custom resources” system wasn’t meeting
> > our needs, because it only allows for “optional resources”, and not
> > “mandatory resources” (resources that should be accounted for all tasks
> in
> > a cluster, even if not required explicitly, like CPU, RAM or disk space.
> > Good candidates are network bandwidth or disk I/O for instance).
> >
> > To enforce the usage of network bandwidth across all tasks we wanted to
> > allocate an implicit amount of network bandwidth to tasks not declaring
> any
> > in their configuration. One possible implementation was to make the Mesos
> > master automatically compute the allocated network bandwidth for the task
> > when the offer is accepted and subtract this amount from the overall
> > available resources in Mesos. We consider this implicit allocation as a
> > fallback mechanism for frameworks not supporting "mandatory" resources.
> > Indeed, in a proper environment all frameworks would support these
> > mandatory resources. Unfortunately, adding support for a new resource (or
> > for custom resources) in all frameworks might not be manageable in a
> timely
> > manner, especially in an ecosystem with multiple frameworks.
> >
> > Consequently, we wrote a patch in Mesos master to allocate an implicit
> > amount of network bandwidth when it is not provided in the TaskInfo. In
> our
> > case this implicit amount is computed based on the following Criteo
> > specific rule: `task_used_cpu / slave_total_cpus *
> slave_total_bandwidth`.
> >
> > Here is what happened when our frameworks were not supporting network
> > bandwidth yet: offers were sent to frameworks and they accepted or
> rejected
> > them regardless of network bandwidth available on the slave. When an
> offer
> > was accepted, the TaskInfo sent by the framework obviously did not
> contain
> > any network bandwidth but Mesos master implicitly injected some and let
> the
> > task follow its way. There were two cases then: either the slave had
> enough
> > resources to run the task and it was scheduled as expected or it did not
> > have enough resources and the task failed to be deployed and Mesos sent
> > back a TASK_ERROR to the framework. It was the responsibility of the
> > scheduler to retry with following offers. This solution created a bit of
> > extra work for the master but we tested it and ran it in production for
> few
> > weeks in several clusters of around 250 servers each and it seemed to
> work
> > well, at least with Marathon and Aurora. At this point the migration was
> > expected to be smooth because it only required a restart of all the tasks
> > for network bandwidth to be introduced cluster wide. It ended up being as
> > smooth as expected.
> >
> > In the meantime, we obviously patched Marathon and Aurora to add full
> > support of network bandwidth and avoid the potential TASK_ERROR messages
> > while keeping in mind that we'll soon host other frameworks that would
> > probably not support network bandwidth from the beginning. So we'll
> likely
> > keep our patch in the future and we think it might be a good idea to
> > introduce a hook in Mesos master to add implicit resources to tasks.
> >
> > What we propose is to introduce a method called
> > masterLaunchTaskResourceDecorator in the hook interface and call it at
> the
> > right location to let the user add whatever implicit resource he wants.
> >
> > This would give the following signature:
> >
> > ```
> >
> > Result<Resources> masterLaunchTaskResourceDecorator(
> >
> >  const Resources& slaveResources,
> >
> >  TaskInfo& task)
> >
> > ```
> >
> > Can you please tell us if such an integration point would be acceptable
> to
> > be merged upstream?
> >
> >
> > You can have a look at our current implementation here:
> >
> >
> https://github.com/criteo-forks/mesos/compare/before-network-bandwidth...criteo-forks:network-bandwidth
> >  (just as a reference, it is not the base of a patch for upstream mesos).
> >
> > Thank you,
> >
> > Clément.
> >
>

Reply via email to