Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-17 Thread Daniel Vetter
On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> On Fri 06-09-19 08:45:39, Tejun Heo wrote:
> > Hello, Daniel.
> > 
> > On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > > > Hmm... what'd be the fundamental difference from slab or socket memory
> > > > which are handled through memcg?  Is system memory used by GPUs have
> > > > further global restrictions in addition to the amount of physical
> > > > memory used?
> > > 
> > > Sometimes, but that would be specific resources (kinda like vram),
> > > e.g. CMA regions used by a gpu. But probably not something you'll run
> > > in a datacenter and want cgroups for ...
> > > 
> > > I guess we could try to integrate with the memcg group controller. One
> > > trouble is that aside from i915 most gpu drivers do not really have a
> > > full shrinker, so not sure how that would all integrate.
> > 
> > So, while it'd great to have shrinkers in the longer term, it's not a
> > strict requirement to be accounted in memcg.  It already accounts a
> > lot of memory which isn't reclaimable (a lot of slabs and socket
> > buffer).
> 
> Yeah, having a shrinker is preferred but the memory should better be
> reclaimable in some form. If not by any other means then at least bound
> to a user process context so that it goes away with a task being killed
> by the OOM killer. If that is not the case then we cannot really charge
> it because then the memcg controller is of no use. We can tolerate it to
> some degree if the amount of memory charged like that is negligible to
> the overall size. But from the discussion it seems that these buffers
> are really large.

I think we can just make "must have a shrinker" as a requirement for
system memory cgroup thing for gpu buffers. There's mild locking inversion
fun to be had when typing one, but I think the problem is well-understood
enough that this isn't a huge hurdle to climb over. And should give admins
an easier to mange system, since it works more like what they know
already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-10 Thread Michal Hocko
On Tue 10-09-19 09:03:29, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> > > So, while it'd great to have shrinkers in the longer term, it's not a
> > > strict requirement to be accounted in memcg.  It already accounts a
> > > lot of memory which isn't reclaimable (a lot of slabs and socket
> > > buffer).
> > 
> > Yeah, having a shrinker is preferred but the memory should better be
> > reclaimable in some form. If not by any other means then at least bound
> > to a user process context so that it goes away with a task being killed
> > by the OOM killer. If that is not the case then we cannot really charge
> > it because then the memcg controller is of no use. We can tolerate it to
> > some degree if the amount of memory charged like that is negligible to
> > the overall size. But from the discussion it seems that these buffers
> > are really large.
> 
> Yeah, oom kills should be able to reduce the usage; however, please
> note that tmpfs, among other things, can already escape this
> restriction and we can have cgroups which are over max and empty.
> It's obviously not ideal but the system doesn't melt down from it
> either.

Right, and that is a reason why an access to tmpfs should be restricted
when containing a workload by memcg. My understanding of this particular
feature is that memcg should be the primary containment method and
that's why I brought this up.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-10 Thread Tejun Heo
Hello, Michal.

On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> > So, while it'd great to have shrinkers in the longer term, it's not a
> > strict requirement to be accounted in memcg.  It already accounts a
> > lot of memory which isn't reclaimable (a lot of slabs and socket
> > buffer).
> 
> Yeah, having a shrinker is preferred but the memory should better be
> reclaimable in some form. If not by any other means then at least bound
> to a user process context so that it goes away with a task being killed
> by the OOM killer. If that is not the case then we cannot really charge
> it because then the memcg controller is of no use. We can tolerate it to
> some degree if the amount of memory charged like that is negligible to
> the overall size. But from the discussion it seems that these buffers
> are really large.

Yeah, oom kills should be able to reduce the usage; however, please
note that tmpfs, among other things, can already escape this
restriction and we can have cgroups which are over max and empty.
It's obviously not ideal but the system doesn't melt down from it
either.

Thanks.

-- 
tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-10 Thread Michal Hocko
On Fri 06-09-19 08:45:39, Tejun Heo wrote:
> Hello, Daniel.
> 
> On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > > Hmm... what'd be the fundamental difference from slab or socket memory
> > > which are handled through memcg?  Is system memory used by GPUs have
> > > further global restrictions in addition to the amount of physical
> > > memory used?
> > 
> > Sometimes, but that would be specific resources (kinda like vram),
> > e.g. CMA regions used by a gpu. But probably not something you'll run
> > in a datacenter and want cgroups for ...
> > 
> > I guess we could try to integrate with the memcg group controller. One
> > trouble is that aside from i915 most gpu drivers do not really have a
> > full shrinker, so not sure how that would all integrate.
> 
> So, while it'd great to have shrinkers in the longer term, it's not a
> strict requirement to be accounted in memcg.  It already accounts a
> lot of memory which isn't reclaimable (a lot of slabs and socket
> buffer).

Yeah, having a shrinker is preferred but the memory should better be
reclaimable in some form. If not by any other means then at least bound
to a user process context so that it goes away with a task being killed
by the OOM killer. If that is not the case then we cannot really charge
it because then the memcg controller is of no use. We can tolerate it to
some degree if the amount of memory charged like that is negligible to
the overall size. But from the discussion it seems that these buffers
are really large.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-06 Thread Tejun Heo
Hello, Daniel.

On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > Hmm... what'd be the fundamental difference from slab or socket memory
> > which are handled through memcg?  Is system memory used by GPUs have
> > further global restrictions in addition to the amount of physical
> > memory used?
> 
> Sometimes, but that would be specific resources (kinda like vram),
> e.g. CMA regions used by a gpu. But probably not something you'll run
> in a datacenter and want cgroups for ...
> 
> I guess we could try to integrate with the memcg group controller. One
> trouble is that aside from i915 most gpu drivers do not really have a
> full shrinker, so not sure how that would all integrate.

So, while it'd great to have shrinkers in the longer term, it's not a
strict requirement to be accounted in memcg.  It already accounts a
lot of memory which isn't reclaimable (a lot of slabs and socket
buffer).

> The overall gpu memory controller would still be outside of memcg I
> think, since that would include swapped-out gpu objects, and stuff in
> special memory regions like vram.

Yeah, for resources which are on the GPU itself or hard limitations
arising from it.  In general, we wanna make cgroup controllers control
something real and concrete as in physical resources.

> > At the system level, it just gets folded into cpu time, which isn't
> > perfect but is usually a good enough approximation of compute related
> > dynamic resources.  Can gpu do someting similar or at least start with
> > that?
> 
> So generally there's a pile of engines, often of different type (e.g.
> amd hw has an entire pile of copy engines), with some ill-defined
> sharing charateristics for some (often compute/render engines use the
> same shader cores underneath), kinda like hyperthreading. So at that
> detail it's all extremely hw specific, and probably too hard to
> control in a useful way for users. And I'm not sure we can really do a
> reasonable knob for overall gpu usage, e.g. if we include all the copy
> engines, but the workloads are only running on compute engines, then
> you might only get 10% overall utilization by engine-time. While the
> shaders (which is most of the chip area/power consumption) are
> actually at 100%. On top, with many userspace apis those engines are
> an internal implementation detail of a more abstract gpu device (e.g.
> opengl), but with others, this is all fully exposed (like vulkan).
> 
> Plus the kernel needs to use at least copy engines for vram management
> itself, and you really can't take that away. Although Kenny here has
> some proposal for a separate cgroup resource just for that.
> 
> I just think it's all a bit too ill-defined, and we might be better
> off nailing the memory side first and get some real world experience
> on this stuff. For context, there's not even a cross-driver standard
> for how priorities are handled, that's all driver-specific interfaces.

I see.  Yeah, figuring it out as this develops makes sense to me.  One
thing I wanna raise is that in general we don't want to expose device
or implementation details in cgroup interface.  What we want expressed
there is the intentions of the user.  The more internal details we
expose the more we end up getting tied down to the specific
implementation which we should avoid especially given the early stage
of development.

Thanks.

-- 
tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-06 Thread Daniel Vetter
On Fri, Sep 6, 2019 at 5:23 PM Tejun Heo  wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
> > I think system memory separate from vram makes sense. For one, vram is
> > like 10x+ faster than system memory, so we definitely want to have
> > good control on that. But maybe we only want one vram bucket overall
> > for the entire system?
> >
> > The trouble with system memory is that gpu tasks pin that memory to
> > prep execution. There's two solutions:
> > - i915 has a shrinker. Lots (and I really mean lots) of pain with
> > direct reclaim recursion, which often means we can't free memory, and
> > we're angering the oom killer a lot. Plus it introduces real bad
> > latency spikes everywhere (gpu workloads are occasionally really slow,
> > think "worse than pageout to spinning rust" to get memory freed).
> > - ttm just has a global limit, set to 50% of system memory.
> >
> > I do think a global system memory limit to tame the shrinker, without
> > the ttm approach of possible just wasting half your memory, could be
> > useful.
>
> Hmm... what'd be the fundamental difference from slab or socket memory
> which are handled through memcg?  Is system memory used by GPUs have
> further global restrictions in addition to the amount of physical
> memory used?

Sometimes, but that would be specific resources (kinda like vram),
e.g. CMA regions used by a gpu. But probably not something you'll run
in a datacenter and want cgroups for ...

I guess we could try to integrate with the memcg group controller. One
trouble is that aside from i915 most gpu drivers do not really have a
full shrinker, so not sure how that would all integrate.

The overall gpu memory controller would still be outside of memcg I
think, since that would include swapped-out gpu objects, and stuff in
special memory regions like vram.

> > I'm also not sure of the bw limits, given all the fun we have on the
> > block io cgroups side. Aside from that the current bw limit only
> > controls the bw the kernel uses, userspace can submit unlimited
> > amounts of copying commands that use the same pcie links directly to
> > the gpu, bypassing this cg knob. Also, controlling execution time for
> > gpus is very tricky, since they work a lot more like a block io device
> > or maybe a network controller with packet scheduling, than a cpu.
>
> At the system level, it just gets folded into cpu time, which isn't
> perfect but is usually a good enough approximation of compute related
> dynamic resources.  Can gpu do someting similar or at least start with
> that?

So generally there's a pile of engines, often of different type (e.g.
amd hw has an entire pile of copy engines), with some ill-defined
sharing charateristics for some (often compute/render engines use the
same shader cores underneath), kinda like hyperthreading. So at that
detail it's all extremely hw specific, and probably too hard to
control in a useful way for users. And I'm not sure we can really do a
reasonable knob for overall gpu usage, e.g. if we include all the copy
engines, but the workloads are only running on compute engines, then
you might only get 10% overall utilization by engine-time. While the
shaders (which is most of the chip area/power consumption) are
actually at 100%. On top, with many userspace apis those engines are
an internal implementation detail of a more abstract gpu device (e.g.
opengl), but with others, this is all fully exposed (like vulkan).

Plus the kernel needs to use at least copy engines for vram management
itself, and you really can't take that away. Although Kenny here has
some proposal for a separate cgroup resource just for that.

I just think it's all a bit too ill-defined, and we might be better
off nailing the memory side first and get some real world experience
on this stuff. For context, there's not even a cross-driver standard
for how priorities are handled, that's all driver-specific interfaces.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-06 Thread Tejun Heo
Hello, Daniel.

On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
> I think system memory separate from vram makes sense. For one, vram is
> like 10x+ faster than system memory, so we definitely want to have
> good control on that. But maybe we only want one vram bucket overall
> for the entire system?
> 
> The trouble with system memory is that gpu tasks pin that memory to
> prep execution. There's two solutions:
> - i915 has a shrinker. Lots (and I really mean lots) of pain with
> direct reclaim recursion, which often means we can't free memory, and
> we're angering the oom killer a lot. Plus it introduces real bad
> latency spikes everywhere (gpu workloads are occasionally really slow,
> think "worse than pageout to spinning rust" to get memory freed).
> - ttm just has a global limit, set to 50% of system memory.
> 
> I do think a global system memory limit to tame the shrinker, without
> the ttm approach of possible just wasting half your memory, could be
> useful.

Hmm... what'd be the fundamental difference from slab or socket memory
which are handled through memcg?  Is system memory used by GPUs have
further global restrictions in addition to the amount of physical
memory used?

> I'm also not sure of the bw limits, given all the fun we have on the
> block io cgroups side. Aside from that the current bw limit only
> controls the bw the kernel uses, userspace can submit unlimited
> amounts of copying commands that use the same pcie links directly to
> the gpu, bypassing this cg knob. Also, controlling execution time for
> gpus is very tricky, since they work a lot more like a block io device
> or maybe a network controller with packet scheduling, than a cpu.

At the system level, it just gets folded into cpu time, which isn't
perfect but is usually a good enough approximation of compute related
dynamic resources.  Can gpu do someting similar or at least start with
that?

Thanks.

-- 
tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Daniel Vetter
On Tue, Sep 3, 2019 at 8:50 PM Tejun Heo  wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > > * While breaking up and applying control to different types of
> > >   internal objects may seem attractive to folks who work day in and
> > >   day out with the subsystem, they aren't all that useful to users and
> > >   the siloed controls are likely to make the whole mechanism a lot
> > >   less useful.  We had the same problem with cgroup1 memcg - putting
> > >   control of different uses of memory under separate knobs.  It made
> > >   the whole thing pretty useless.  e.g. if you constrain all knobs
> > >   tight enough to control the overall usage, overall utilization
> > >   suffers, but if you don't, you really don't have control over actual
> > >   usage.  For memcg, what has to be allocated and controlled is
> > >   physical memory, no matter how they're used.  It's not like you can
> > >   go buy more "socket" memory.  At least from the looks of it, I'm
> > >   afraid gpu controller is repeating the same mistakes.
> >
> > We do have quite a pile of different memories and ranges, so I don't
> > thinkt we're doing the same mistake here. But it is maybe a bit too
>
> I see.  One thing which caught my eyes was the system memory control.
> Shouldn't that be controlled by memcg?  Is there something special
> about system memory used by gpus?

I think system memory separate from vram makes sense. For one, vram is
like 10x+ faster than system memory, so we definitely want to have
good control on that. But maybe we only want one vram bucket overall
for the entire system?

The trouble with system memory is that gpu tasks pin that memory to
prep execution. There's two solutions:
- i915 has a shrinker. Lots (and I really mean lots) of pain with
direct reclaim recursion, which often means we can't free memory, and
we're angering the oom killer a lot. Plus it introduces real bad
latency spikes everywhere (gpu workloads are occasionally really slow,
think "worse than pageout to spinning rust" to get memory freed).
- ttm just has a global limit, set to 50% of system memory.

I do think a global system memory limit to tame the shrinker, without
the ttm approach of possible just wasting half your memory, could be
useful.

> > complicated, and exposes stuff that most users really don't care about.
>
> Could be from me not knowing much about gpus but definitely looks too
> complex to me.  I don't see how users would be able to alloate, vram,
> system memory and GART with reasonable accuracy.  memcg on cgroup2
> deals with just single number and that's already plenty challenging.

Yeah, especially wrt GART and some of the other more specialized
things I don't think there's any modern gpu were you can actually run
out of that stuff. At least not before you run out of every other kind
of memory (GART is just a remapping table to make system memory
visible to the gpu).

I'm also not sure of the bw limits, given all the fun we have on the
block io cgroups side. Aside from that the current bw limit only
controls the bw the kernel uses, userspace can submit unlimited
amounts of copying commands that use the same pcie links directly to
the gpu, bypassing this cg knob. Also, controlling execution time for
gpus is very tricky, since they work a lot more like a block io device
or maybe a network controller with packet scheduling, than a cpu.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Kenny Ho
On Tue, Sep 3, 2019 at 5:20 AM Daniel Vetter  wrote:
>
> On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian
>  wrote:
> >
> > Am 03.09.19 um 10:02 schrieb Daniel Vetter:
> > > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
> > >> With this RFC v4, I am hoping to have some consensus on a merge plan.  I 
> > >> believe
> > >> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
> > >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC 
> > >> are
> > >> uncontroversial and ready to move out of RFC and into a more formal 
> > >> review.  I
> > >> will continue to work on the memory backend resources (drm.memory.*).
> > >>
> > >> The cover letter from v1 is copied below for reference.
> > >>
> > >> [v1]: 
> > >> https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> > >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> > >> [v3]: 
> > >> https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> > > So looking at all this doesn't seem to have changed much, and the old
> > > discussion didn't really conclude anywhere (aside from some details).
> > >
> > > One more open though that crossed my mind, having read a ton of ttm again
> > > recently: How does this all interact with ttm global limits? I'd say the
> > > ttm global limits is the ur-cgroups we have in drm, and not looking at
> > > that seems kinda bad.
> >
> > At least my hope was to completely replace ttm globals with those
> > limitations here when it is ready.
>
> You need more, at least some kind of shrinker to cut down bo placed in
> system memory when we're under memory pressure. Which drags in a
> pretty epic amount of locking lols (see i915's shrinker fun, where we
> attempt that). Probably another good idea to share at least some
> concepts, maybe even code.

I am still looking into your shrinker suggestion so the memory.*
resources are untouch from RFC v3.  The main change for the buffer.*
resources is the removal of buffer sharing restriction as you
suggested and additional documentation of that behaviour.  (I may have
neglected mentioning it in the cover.)  The other key part of RFC v4
is the "logical GPU/lgpu" concept.  I am hoping to get it out there
early for feedback while I continue to work on the memory.* parts.

Kenny

> -Daniel
>
> >
> > Christian.
> >
> > > -Daniel
> > >
> > >> v4:
> > >> Unchanged (no review needed)
> > >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory 
> > >> bandwidth
> > >> and shrinker)
> > >> Base on feedbacks on v3:
> > >> * update nominclature to drmcg
> > >> * embed per device drmcg properties into drm_device
> > >> * split GEM buffer related commits into stats and limit
> > >> * rename function name to align with convention
> > >> * combined buffer accounting and check into a try_charge function
> > >> * support buffer stats without limit enforcement
> > >> * removed GEM buffer sharing limitation
> > >> * updated documentations
> > >> New features:
> > >> * introducing logical GPU concept
> > >> * example implementation with AMD KFD
> > >>
> > >> v3:
> > >> Base on feedbacks on v2:
> > >> * removed .help type file from v2
> > >> * conform to cgroup convention for default and max handling
> > >> * conform to cgroup convention for addressing device specific limits 
> > >> (with major:minor)
> > >> New function:
> > >> * adopted memparse for memory size related attributes
> > >> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
> > >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> > >> * added ttm buffer usage limit (per cgroup, for vram.)
> > >> * added per cgroup bandwidth stats and limiting (burst and average 
> > >> bandwidth)
> > >>
> > >> v2:
> > >> * Removed the vendoring concepts
> > >> * Add limit to total buffer allocation
> > >> * Add limit to the maximum size of a buffer allocation
> > >>
> > >> v1: cover letter
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Kenny Ho
Hi Tejun,

Thanks for looking into this.  I can definitely help where I can and I
am sure other experts will jump in if I start misrepresenting the
reality :) (as Daniel already have done.)

Regarding your points, my understanding is that there isn't really a
TTM vs GEM situation anymore (there is an lwn.net article about that,
but it is more than a decade old.)  I believe GEM is the common
interface at this point and more and more features are being
refactored into it.  For example, AMD's driver uses TTM internally but
things are exposed via the GEM interface.

This GEM resource is actually the single number resource you just
referred to.  A GEM buffer (the drm.buffer.* resources) can be backed
by VRAM, or system memory or other type of memory.  The more fine
grain control is the drm.memory.* resources which still need more
discussion.  (As some of the functionalities in TTM are being
refactored into the GEM level.  I have seen some patches that make TTM
a subclass of GEM.)

This RFC can be grouped into 3 areas and they are fairly independent
so they can be reviewed separately: high level device memory control
(buffer.*), fine grain memory control and bandwidth (memory.*) and
compute resources (lgpu.*)  I think the memory.* resources are the
most controversial part but I think it's still needed.

Perhaps an analogy may help.  For a system, we have CPUs and memory.
And within memory, it can be backed by RAM or swap.  For GPU, each
device can have LGPUs and buffers.  And within the buffers, it can be
backed by VRAM, or system RAM or even swap.

As for setting the right amount, I think that's where the profiling
aspect of the *.stats comes in.  And while one can't necessary buy
more VRAM, it is still a useful knob to adjust if the intention is to
pack more work into a GPU device with predictable performance.  This
research on various GPU workload may be of interest:

A Taxonomy of GPGPU Performance Scaling
http://www.computermachines.org/joe/posters/iiswc2015_taxonomy.pdf
http://www.computermachines.org/joe/publications/pdfs/iiswc2015_taxonomy.pdf

(summary: GPU workload can be memory bound or compute bound.  So it's
possible to pack different workload together to improve utilization.)

Regards,
Kenny

On Tue, Sep 3, 2019 at 2:50 PM Tejun Heo  wrote:
>
> Hello, Daniel.
>
> On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > > * While breaking up and applying control to different types of
> > >   internal objects may seem attractive to folks who work day in and
> > >   day out with the subsystem, they aren't all that useful to users and
> > >   the siloed controls are likely to make the whole mechanism a lot
> > >   less useful.  We had the same problem with cgroup1 memcg - putting
> > >   control of different uses of memory under separate knobs.  It made
> > >   the whole thing pretty useless.  e.g. if you constrain all knobs
> > >   tight enough to control the overall usage, overall utilization
> > >   suffers, but if you don't, you really don't have control over actual
> > >   usage.  For memcg, what has to be allocated and controlled is
> > >   physical memory, no matter how they're used.  It's not like you can
> > >   go buy more "socket" memory.  At least from the looks of it, I'm
> > >   afraid gpu controller is repeating the same mistakes.
> >
> > We do have quite a pile of different memories and ranges, so I don't
> > thinkt we're doing the same mistake here. But it is maybe a bit too
>
> I see.  One thing which caught my eyes was the system memory control.
> Shouldn't that be controlled by memcg?  Is there something special
> about system memory used by gpus?
>
> > complicated, and exposes stuff that most users really don't care about.
>
> Could be from me not knowing much about gpus but definitely looks too
> complex to me.  I don't see how users would be able to alloate, vram,
> system memory and GART with reasonable accuracy.  memcg on cgroup2
> deals with just single number and that's already plenty challenging.
>
> Thanks.
>
> --
> tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Tejun Heo
Hello, Daniel.

On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > * While breaking up and applying control to different types of
> >   internal objects may seem attractive to folks who work day in and
> >   day out with the subsystem, they aren't all that useful to users and
> >   the siloed controls are likely to make the whole mechanism a lot
> >   less useful.  We had the same problem with cgroup1 memcg - putting
> >   control of different uses of memory under separate knobs.  It made
> >   the whole thing pretty useless.  e.g. if you constrain all knobs
> >   tight enough to control the overall usage, overall utilization
> >   suffers, but if you don't, you really don't have control over actual
> >   usage.  For memcg, what has to be allocated and controlled is
> >   physical memory, no matter how they're used.  It's not like you can
> >   go buy more "socket" memory.  At least from the looks of it, I'm
> >   afraid gpu controller is repeating the same mistakes.
> 
> We do have quite a pile of different memories and ranges, so I don't
> thinkt we're doing the same mistake here. But it is maybe a bit too

I see.  One thing which caught my eyes was the system memory control.
Shouldn't that be controlled by memcg?  Is there something special
about system memory used by gpus?

> complicated, and exposes stuff that most users really don't care about.

Could be from me not knowing much about gpus but definitely looks too
complex to me.  I don't see how users would be able to alloate, vram,
system memory and GART with reasonable accuracy.  memcg on cgroup2
deals with just single number and that's already plenty challenging.

Thanks.

-- 
tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Daniel Vetter
On Tue, Sep 3, 2019 at 10:24 AM Koenig, Christian
 wrote:
>
> Am 03.09.19 um 10:02 schrieb Daniel Vetter:
> > On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
> >> This is a follow up to the RFC I made previously to introduce a cgroup
> >> controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
> >> provide resource management to GPU resources using things like container.
> >>
> >> With this RFC v4, I am hoping to have some consensus on a merge plan.  I 
> >> believe
> >> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
> >> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
> >> uncontroversial and ready to move out of RFC and into a more formal 
> >> review.  I
> >> will continue to work on the memory backend resources (drm.memory.*).
> >>
> >> The cover letter from v1 is copied below for reference.
> >>
> >> [v1]: 
> >> https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> >> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> >> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> > So looking at all this doesn't seem to have changed much, and the old
> > discussion didn't really conclude anywhere (aside from some details).
> >
> > One more open though that crossed my mind, having read a ton of ttm again
> > recently: How does this all interact with ttm global limits? I'd say the
> > ttm global limits is the ur-cgroups we have in drm, and not looking at
> > that seems kinda bad.
>
> At least my hope was to completely replace ttm globals with those
> limitations here when it is ready.

You need more, at least some kind of shrinker to cut down bo placed in
system memory when we're under memory pressure. Which drags in a
pretty epic amount of locking lols (see i915's shrinker fun, where we
attempt that). Probably another good idea to share at least some
concepts, maybe even code.
-Daniel

>
> Christian.
>
> > -Daniel
> >
> >> v4:
> >> Unchanged (no review needed)
> >> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory 
> >> bandwidth
> >> and shrinker)
> >> Base on feedbacks on v3:
> >> * update nominclature to drmcg
> >> * embed per device drmcg properties into drm_device
> >> * split GEM buffer related commits into stats and limit
> >> * rename function name to align with convention
> >> * combined buffer accounting and check into a try_charge function
> >> * support buffer stats without limit enforcement
> >> * removed GEM buffer sharing limitation
> >> * updated documentations
> >> New features:
> >> * introducing logical GPU concept
> >> * example implementation with AMD KFD
> >>
> >> v3:
> >> Base on feedbacks on v2:
> >> * removed .help type file from v2
> >> * conform to cgroup convention for default and max handling
> >> * conform to cgroup convention for addressing device specific limits (with 
> >> major:minor)
> >> New function:
> >> * adopted memparse for memory size related attributes
> >> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
> >> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> >> * added ttm buffer usage limit (per cgroup, for vram.)
> >> * added per cgroup bandwidth stats and limiting (burst and average 
> >> bandwidth)
> >>
> >> v2:
> >> * Removed the vendoring concepts
> >> * Add limit to total buffer allocation
> >> * Add limit to the maximum size of a buffer allocation
> >>
> >> v1: cover letter
> >>
> >> The purpose of this patch series is to start a discussion for a generic 
> >> cgroup
> >> controller for the drm subsystem.  The design proposed here is a very 
> >> early one.
> >> We are hoping to engage the community as we develop the idea.
> >>
> >>
> >> Backgrounds
> >> ==
> >> Control Groups/cgroup provide a mechanism for aggregating/partitioning 
> >> sets of
> >> tasks, and all their future children, into hierarchical groups with 
> >> specialized
> >> behaviour, such as accounting/limiting the resources which processes in a 
> >> cgroup
> >> can access[1].  Weights, limits, protections, allocations are the main 
> >> resource
> >> distribution models.  Existing cgroup controllers includes cpu, memory, io,
> >> rdma, and more.  cgroup is one of the foundational technologies that 
> >> enables the
> >> popular container application deployment and management method.
> >>
> >> Direct Rendering Manager/drm contains code intended to support the needs of
> >> complex graphics devices. Graphics drivers in the kernel may make use of 
> >> DRM
> >> functions to make tasks like memory management, interrupt handling and DMA
> >> easier, and provide a uniform interface to applications.  The DRM has also
> >> developed beyond traditional graphics applications to support compute/GPGPU
> >> applications.
> >>
> >>
> >> Motivations
> >> =
> >> As GPU grow beyond the realm of desktop/workstation graphics into areas 
> >> like
> >> data center clusters and IoT, there are 

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Koenig, Christian
Am 03.09.19 um 10:02 schrieb Daniel Vetter:
> On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
>> This is a follow up to the RFC I made previously to introduce a cgroup
>> controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
>> provide resource management to GPU resources using things like container.
>>
>> With this RFC v4, I am hoping to have some consensus on a merge plan.  I 
>> believe
>> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
>> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
>> uncontroversial and ready to move out of RFC and into a more formal review.  
>> I
>> will continue to work on the memory backend resources (drm.memory.*).
>>
>> The cover letter from v1 is copied below for reference.
>>
>> [v1]: 
>> https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
>> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
>> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html
> So looking at all this doesn't seem to have changed much, and the old
> discussion didn't really conclude anywhere (aside from some details).
>
> One more open though that crossed my mind, having read a ton of ttm again
> recently: How does this all interact with ttm global limits? I'd say the
> ttm global limits is the ur-cgroups we have in drm, and not looking at
> that seems kinda bad.

At least my hope was to completely replace ttm globals with those 
limitations here when it is ready.

Christian.

> -Daniel
>
>> v4:
>> Unchanged (no review needed)
>> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory 
>> bandwidth
>> and shrinker)
>> Base on feedbacks on v3:
>> * update nominclature to drmcg
>> * embed per device drmcg properties into drm_device
>> * split GEM buffer related commits into stats and limit
>> * rename function name to align with convention
>> * combined buffer accounting and check into a try_charge function
>> * support buffer stats without limit enforcement
>> * removed GEM buffer sharing limitation
>> * updated documentations
>> New features:
>> * introducing logical GPU concept
>> * example implementation with AMD KFD
>>
>> v3:
>> Base on feedbacks on v2:
>> * removed .help type file from v2
>> * conform to cgroup convention for default and max handling
>> * conform to cgroup convention for addressing device specific limits (with 
>> major:minor)
>> New function:
>> * adopted memparse for memory size related attributes
>> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
>> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
>> * added ttm buffer usage limit (per cgroup, for vram.)
>> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
>>
>> v2:
>> * Removed the vendoring concepts
>> * Add limit to total buffer allocation
>> * Add limit to the maximum size of a buffer allocation
>>
>> v1: cover letter
>>
>> The purpose of this patch series is to start a discussion for a generic 
>> cgroup
>> controller for the drm subsystem.  The design proposed here is a very early 
>> one.
>> We are hoping to engage the community as we develop the idea.
>>
>>
>> Backgrounds
>> ==
>> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets 
>> of
>> tasks, and all their future children, into hierarchical groups with 
>> specialized
>> behaviour, such as accounting/limiting the resources which processes in a 
>> cgroup
>> can access[1].  Weights, limits, protections, allocations are the main 
>> resource
>> distribution models.  Existing cgroup controllers includes cpu, memory, io,
>> rdma, and more.  cgroup is one of the foundational technologies that enables 
>> the
>> popular container application deployment and management method.
>>
>> Direct Rendering Manager/drm contains code intended to support the needs of
>> complex graphics devices. Graphics drivers in the kernel may make use of DRM
>> functions to make tasks like memory management, interrupt handling and DMA
>> easier, and provide a uniform interface to applications.  The DRM has also
>> developed beyond traditional graphics applications to support compute/GPGPU
>> applications.
>>
>>
>> Motivations
>> =
>> As GPU grow beyond the realm of desktop/workstation graphics into areas like
>> data center clusters and IoT, there are increasing needs to monitor and 
>> regulate
>> GPU as a resource like cpu, memory and io.
>>
>> Matt Roper from Intel began working on similar idea in early 2018 [2] for the
>> purpose of managing GPU priority using the cgroup hierarchy.  While that
>> particular use case may not warrant a standalone drm cgroup controller, there
>> are other use cases where having one can be useful [3].  Monitoring GPU
>> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
>> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
>> sysadmins get a better 

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Daniel Vetter
On Thu, Aug 29, 2019 at 02:05:17AM -0400, Kenny Ho wrote:
> This is a follow up to the RFC I made previously to introduce a cgroup
> controller for the GPU/DRM subsystem [v1,v2,v3].  The goal is to be able to
> provide resource management to GPU resources using things like container.  
> 
> With this RFC v4, I am hoping to have some consensus on a merge plan.  I 
> believe
> the GEM related resources (drm.buffer.*) introduced in previous RFC and,
> hopefully, the logical GPU concept (drm.lgpu.*) introduced in this RFC are
> uncontroversial and ready to move out of RFC and into a more formal review.  I
> will continue to work on the memory backend resources (drm.memory.*).
> 
> The cover letter from v1 is copied below for reference.
> 
> [v1]: 
> https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> [v2]: https://www.spinics.net/lists/cgroups/msg22074.html
> [v3]: https://lists.freedesktop.org/archives/amd-gfx/2019-June/036026.html

So looking at all this doesn't seem to have changed much, and the old
discussion didn't really conclude anywhere (aside from some details).

One more open though that crossed my mind, having read a ton of ttm again
recently: How does this all interact with ttm global limits? I'd say the
ttm global limits is the ur-cgroups we have in drm, and not looking at
that seems kinda bad.
-Daniel

> 
> v4:
> Unchanged (no review needed)
> * drm.memory.*/ttm resources (Patch 9-13, I am still working on memory 
> bandwidth
> and shrinker)
> Base on feedbacks on v3:
> * update nominclature to drmcg
> * embed per device drmcg properties into drm_device
> * split GEM buffer related commits into stats and limit
> * rename function name to align with convention
> * combined buffer accounting and check into a try_charge function
> * support buffer stats without limit enforcement
> * removed GEM buffer sharing limitation
> * updated documentations
> New features:
> * introducing logical GPU concept
> * example implementation with AMD KFD
> 
> v3:
> Base on feedbacks on v2:
> * removed .help type file from v2
> * conform to cgroup convention for default and max handling
> * conform to cgroup convention for addressing device specific limits (with 
> major:minor)
> New function:
> * adopted memparse for memory size related attributes
> * added macro to marshall drmcgrp cftype private  (DRMCG_CTF_PRIV, etc.)
> * added ttm buffer usage stats (per cgroup, for system, tt, vram.)
> * added ttm buffer usage limit (per cgroup, for vram.)
> * added per cgroup bandwidth stats and limiting (burst and average bandwidth)
> 
> v2:
> * Removed the vendoring concepts
> * Add limit to total buffer allocation
> * Add limit to the maximum size of a buffer allocation
> 
> v1: cover letter
> 
> The purpose of this patch series is to start a discussion for a generic cgroup
> controller for the drm subsystem.  The design proposed here is a very early 
> one.
> We are hoping to engage the community as we develop the idea.
> 
> 
> Backgrounds
> ==
> Control Groups/cgroup provide a mechanism for aggregating/partitioning sets of
> tasks, and all their future children, into hierarchical groups with 
> specialized
> behaviour, such as accounting/limiting the resources which processes in a 
> cgroup
> can access[1].  Weights, limits, protections, allocations are the main 
> resource
> distribution models.  Existing cgroup controllers includes cpu, memory, io,
> rdma, and more.  cgroup is one of the foundational technologies that enables 
> the
> popular container application deployment and management method.
> 
> Direct Rendering Manager/drm contains code intended to support the needs of
> complex graphics devices. Graphics drivers in the kernel may make use of DRM
> functions to make tasks like memory management, interrupt handling and DMA
> easier, and provide a uniform interface to applications.  The DRM has also
> developed beyond traditional graphics applications to support compute/GPGPU
> applications.
> 
> 
> Motivations
> =
> As GPU grow beyond the realm of desktop/workstation graphics into areas like
> data center clusters and IoT, there are increasing needs to monitor and 
> regulate
> GPU as a resource like cpu, memory and io.
> 
> Matt Roper from Intel began working on similar idea in early 2018 [2] for the
> purpose of managing GPU priority using the cgroup hierarchy.  While that
> particular use case may not warrant a standalone drm cgroup controller, there
> are other use cases where having one can be useful [3].  Monitoring GPU
> resources such as VRAM and buffers, CU (compute unit [AMD's nomenclature])/EU
> (execution unit [Intel's nomenclature]), GPU job scheduling [4] can help
> sysadmins get a better understanding of the applications usage profile.  
> Further
> usage regulations of the aforementioned resources can also help sysadmins
> optimize workload deployment on limited GPU resources.
> 
> With the increased importance of machine learning, data science and other
> 

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Daniel Vetter
On Fri, Aug 30, 2019 at 09:28:57PM -0700, Tejun Heo wrote:
> Hello,
> 
> I just glanced through the interface and don't have enough context to
> give any kind of detailed review yet.  I'll try to read up and
> understand more and would greatly appreciate if you can give me some
> pointers to read up on the resources being controlled and how the
> actual use cases would look like.  That said, I have some basic
> concerns.
> 
> * TTM vs. GEM distinction seems to be internal implementation detail
>   rather than anything relating to underlying physical resources.
>   Provided that's the case, I'm afraid these internal constructs being
>   used as primary resource control objects likely isn't the right
>   approach.  Whether a given driver uses one or the other internal
>   abstraction layer shouldn't determine how resources are represented
>   at the userland interface layer.

Yeah there's another RFC series from Brian Welty to abstract this away as
a memory region concept for gpus.

> * While breaking up and applying control to different types of
>   internal objects may seem attractive to folks who work day in and
>   day out with the subsystem, they aren't all that useful to users and
>   the siloed controls are likely to make the whole mechanism a lot
>   less useful.  We had the same problem with cgroup1 memcg - putting
>   control of different uses of memory under separate knobs.  It made
>   the whole thing pretty useless.  e.g. if you constrain all knobs
>   tight enough to control the overall usage, overall utilization
>   suffers, but if you don't, you really don't have control over actual
>   usage.  For memcg, what has to be allocated and controlled is
>   physical memory, no matter how they're used.  It's not like you can
>   go buy more "socket" memory.  At least from the looks of it, I'm
>   afraid gpu controller is repeating the same mistakes.

We do have quite a pile of different memories and ranges, so I don't
thinkt we're doing the same mistake here. But it is maybe a bit too
complicated, and exposes stuff that most users really don't care about.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-08-30 Thread Tejun Heo
Hello,

I just glanced through the interface and don't have enough context to
give any kind of detailed review yet.  I'll try to read up and
understand more and would greatly appreciate if you can give me some
pointers to read up on the resources being controlled and how the
actual use cases would look like.  That said, I have some basic
concerns.

* TTM vs. GEM distinction seems to be internal implementation detail
  rather than anything relating to underlying physical resources.
  Provided that's the case, I'm afraid these internal constructs being
  used as primary resource control objects likely isn't the right
  approach.  Whether a given driver uses one or the other internal
  abstraction layer shouldn't determine how resources are represented
  at the userland interface layer.

* While breaking up and applying control to different types of
  internal objects may seem attractive to folks who work day in and
  day out with the subsystem, they aren't all that useful to users and
  the siloed controls are likely to make the whole mechanism a lot
  less useful.  We had the same problem with cgroup1 memcg - putting
  control of different uses of memory under separate knobs.  It made
  the whole thing pretty useless.  e.g. if you constrain all knobs
  tight enough to control the overall usage, overall utilization
  suffers, but if you don't, you really don't have control over actual
  usage.  For memcg, what has to be allocated and controlled is
  physical memory, no matter how they're used.  It's not like you can
  go buy more "socket" memory.  At least from the looks of it, I'm
  afraid gpu controller is repeating the same mistakes.

Thanks.

-- 
tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel