Re: GPU Users -- Deprecation of GPU_RESOURCES capability

Kevin Klues Fri, 26 May 2017 14:34:13 -0700

I've added JIRAs to:

1) Add master flag `--filter-gpu-resources={true|false}`
https://issues.apache.org/jira/browse/MESOS-7576


2) Deprecate GPU_RESOURCES capability and master flag
`--filter-gpu-resources={true|false}`
https://issues.apache.org/jira/browse/MESOS-7579

3) Remove GPU_RESOURCES capability and master flag
`--filter-gpu-resources={true|false}`
https://issues.apache.org/jira/browse/MESOS-7577

Kevin

On Fri, May 26, 2017 at 1:49 PM Benjamin Mahler <bmah...@apache.org> wrote:

> I filed https://issues.apache.org/jira/browse/MESOS-7574 for reservations
> to multiple roles. We'll find one that captures the deprecation of the
> GPU_RESOURCES capability as well, with reservations to multiple roles as a
> blocker.
>
> On Fri, May 26, 2017 at 8:54 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
> > Hi Benjamin,
> >
> > Thanks for getting back. Do you have an issue already filed for
> > the "reservations to multiple roles" story, or is it folded under another
> > JIRA story?
> >
> >
> >
> > On Fri, May 26, 2017 at 12:44 AM, Benjamin Mahler <bmah...@apache.org>
> > wrote:
> >
> > > Thanks for the feedback!
> > >
> > > There have been some discussions for allowing reservations to multiple
> > > roles (or more generally, role expressions), which is essentially what
> > > you've suggested Zhitao. (However, note that what is provided by the
> GPU
> > > capability filtering is not quite this, it's actually analogous to a
> > > reservation for multiple schedulers, not roles). Reservations to
> multiple
> > > roles seems to be the right replacement for those who rely on the GPU
> > > filtering behavior.
> > >
> > > Since we don't have reservations to multiple roles at this point, we
> > > shouldn't deprecate the GPU_RESOURCES capability until this is in
> place.
> > >
> > > With hierarchical roles, it's possible (although potentially clumsy) to
> > > achieve roughly what is provided by the GPU filtering using sub-roles.
> > > Since reservations made to a "gpu" role would be available to all of
> the
> > > descendant roles within tree, e.g.
> > > "gpu/analytics", "gpu/forecasting/training", etc. This is equivalent
> to a
> > > restricted version of reservations to multiple roles, where the roles
> are
> > > restricted to the descendant roles. This can get clumsy because if
> > > "eng/backend/image-processing" wants to get in on the reserved gpus,
> the
> > > user would have to place a related role underneath the "gpu" role, e.g.
> > > "gpu/eng/backend/image-processing".
> > >
> >
> > The exact reason you mentioned about the "clumsy" part would effectively
> > prevent me of implementing this in our org even if it's already
> available.
> >
> >
> > >
> > > For the addition of the filter, note that this flag would be a
> temporary
> > > measure that would be removed when the deprecation cycle of the
> > capability
> > > is complete. It would be good to independently consider the generalized
> > > filtering idea you brought up.
> > >
> > > On Mon, May 22, 2017 at 9:15 AM, Zhitao Li <zhitaoli...@gmail.com>
> > wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > Thanks for engaging with the community on this. My 2 cents:
> > > >
> > > > 1. I feel that this capabilities has a particular useful semantic
> which
> > > is
> > > > lacking in the current reservation system: reserving some scarce
> > resource
> > > > for a* dynamic list of multiple roles:*
> > > >
> > > > Right now, any reservation (static or dynamic) can only express the
> > > > semantic of "reserving this resource for the given role R". However,
> > in a
> > > > complex cluster, it is possible that we have [R1, R2, ..., RN] which
> > > wants
> > > > to share the scarce resource among them but there is another set of
> > roles
> > > > which should never see the given resource.
> > > >
> > > > The new hierarchical role (and/or multi-role?) might be able to
> > provide a
> > > > better solution, but until that's widely available and adopted, the
> > > > capabilities based hack is the only thing I know that can solve the
> > > > problem.
> > > >
> > > > In fact, I think if we are going to wo with `--filter-gpu-resources`
> > > path,
> > > > I think we should make the filter more powerful (i.e, able to handle
> > all
> > > > known framework <-> resource/host constraints and more types of
> scarce
> > > > resources) instead of the piecewise patches on a specific use case.
> > > >
> > > > Happy to chat more on this topic.
> > > >
> > > > On Sat, May 20, 2017 at 6:45 PM, Kevin Klues <klue...@gmail.com>
> > wrote:
> > > >
> > > > > Hello GPU users,
> > > > >
> > > > > We are currently considering deprecating the requirement that
> > > frameworks
> > > > > register with the GPU _RESOURCES capability in order to receive
> > offers
> > > > that
> > > > > contain GPUs. Going forward, we will recommend that users rely on
> > > Mesos's
> > > > > builtin `reservation` mechanism to achieve similar results.
> > > > >
> > > > > Before deprecating it, we wanted to get a sense from the community
> if
> > > > > anyone is currently relying on this capability and would like to
> see
> > it
> > > > > persist. If not, we will begin deprecating it in the next Mesos
> > release
> > > > and
> > > > > completely remove it in Mesos 2.0.
> > > > >
> > > > > As background, the original motivation for this capability was to
> > keep
> > > > > “legacy” frameworks from inadvertently scheduling jobs that don’t
> > > require
> > > > > GPUs on GPU capable machines and thus starving out other frameworks
> > > that
> > > > > legitimately want to place GPU jobs on those machines. The
> assumption
> > > > here
> > > > > was that most machines in a cluster won't have GPUs installed on
> > them,
> > > so
> > > > > some mechanism was necessary to keep legacy frameworks from
> > scheduling
> > > > jobs
> > > > > on those machines. In essence, it provided an implicit reservation
> of
> > > GPU
> > > > > machines for "GPU aware" frameworks, bypassing the traditional
> > > > > `reservation` mechanism already built into Mesos.
> > > > >
> > > > > In such a setup, legacy frameworks would be free to schedule jobs
> on
> > > > > non-GPU machines, and "GPU aware" frameworks would be free to
> > schedule
> > > > GPU
> > > > > jobs GPU machines and other types of jobs on other machines (or mix
> > and
> > > > > match them however they please).
> > > > >
> > > > > However, the problem comes when *all* machines in a cluster contain
> > > GPUs
> > > > > (or even if most of the machines in a cluster container them). When
> > > this
> > > > is
> > > > > the case, we have the opposite problem we were trying to solve by
> > > > > introducing the GPU_RESOURCES capability in the first place. We end
> > up
> > > > > starving out jobs from legacy frameworks that *don’t* require GPU
> > > > resources
> > > > > because there are not enough machines available that don’t have
> GPUs
> > on
> > > > > them to service those jobs. We've actually seen this problem
> manifest
> > > in
> > > > > the wild at least once.
> > > > >
> > > > > An alternative to completely deprecating the GPU_RESOURCES flag
> would
> > > be
> > > > to
> > > > > add a new flag to the mesos master called `--filter-gpu-resources`.
> > > When
> > > > > set to `true`, this flag will cause the mesos master to continue to
> > > > > function as it does today. That is, it would filter offers
> containing
> > > GPU
> > > > > resources and only send them to frameworks that opt into the
> > > > GPU_RESOURCES
> > > > > framework capability. When set to `false`, this flag would cause
> the
> > > > master
> > > > > to *not* filter offers containing GPU resources, and
> indiscriminately
> > > > send
> > > > > them to all frameworks whether they set the GPU_RESOURCES
> capability
> > or
> > > > > not.
> > > > >
> > > > > , this flag would allow them to keep relying on it without
> > disruption.
> > > > >
> > > > > We'd prefer to deprecate the capability completely, but would
> > consider
> > > > > adding this flag if people are currently relying on the
> GPU_RESOURCES
> > > > > capability and would like to see it persist
> > > > >
> > > > > We welcome any feedback you have.
> > > > >
> > > > > Kevin + Ben
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > Zhitao Li
> > > >
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>

Re: GPU Users -- Deprecation of GPU_RESOURCES capability

Reply via email to