Re: [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

2023-05-10 Thread Tejun Heo
Hello,

On Wed, May 10, 2023 at 04:59:01PM +0200, Maarten Lankhorst wrote:
> The misc controller is not granular enough. A single computer may have any 
> number of
> graphics cards, some of them with multiple regions of vram inside a single 
> card.

Extending the misc controller to support dynamic keys shouldn't be that
difficult.

...
> In the next version, I will move all the code for handling the resource limit 
> to
> TTM's eviction layer, because otherwise it cannot handle the resource limit 
> correctly.
> 
> The effect of moving the code to TTM, is that it will make the code even more 
> generic
> for drivers that have vram and use TTM. When using TTM, you only have to 
> describe your
> VRAM, update some fields in the TTM manager and (un)register your device with 
> the
> cgroup handler on (un)load. It's quite trivial to add vram accounting to 
> amdgpu and
> nouveau. [2]
> 
> If you want to add a knob for scheduling weight for a process, it makes sense 
> to
> also add resource usage as a knob, otherwise the effect of that knob is very
> limited. So even for Tvrtko's original proposed usecase, it would make sense.

It does make sense but unlike Tvrtko's scheduling weights what's being
proposed doesn't seem to encapsulate GPU memory resource in a generic enough
manner at least to my untrained eyes. ie. w/ drm.weight, I don't need any
specific knoweldge of how a specific GPU operates to say "this guy should
get 2x processing power over that guy". This more or less holds for other
major resources including CPU, memory and IO. What you're proposing seems a
lot more tied to hardware details and users would have to know a lot more
about how memory is configured on that particular GPU.

Now, if this is inherent to how all, or at least most, GPUs operate, sure,
but otherwise let's start small in terms of interface and not take up space
which should be for something universal. If this turns out to be the way,
expanding to take up the generic interface space isn't difficult.

I don't know GPU space so please educate me where I'm wrong.

Thanks.

-- 
tejun


Re: [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

2023-05-05 Thread Tejun Heo
Hello,

On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:
> RFC as I'm looking for comments.
> 
> For long running compute, it can be beneficial to partition the GPU memory
> between cgroups, so each cgroup can use its maximum amount of memory without
> interfering with other scheduled jobs. Done properly, this can alleviate the
> need for eviction, which might result in a job being terminated if the GPU
> doesn't support mid-thread preemption or recoverable page faults.
> 
> This is done by adding a bunch of knobs to cgroup:
> drm.capacity: Shows maximum capacity of each resource region.
> drm.max: Display or limit max amount of memory.
> drm.current: Current amount of memory in use.
> 
> TTM has not been made cgroup aware yet, so instead of evicting from
> the current cgroup to stay within the cgroup limits, it simply returns
> the error -ENOSPC to userspace.
> 
> I've used Tvrtko's cgroup controller series as a base, but it implemented
> scheduling weight, not memory accounting, so I only ended up keeping the
> base patch.
> 
> Xe is not upstream yet, so the driver specific patch will only apply on
> https://gitlab.freedesktop.org/drm/xe/kernel

Some high-level feedbacks.

* There have been multiple attempts at this but the track record is kinda
  poor. People don't seem to agree what should constitute DRM memory and how
  they should be accounted / controlled.

* I like Tvrtko's scheduling patchset because it exposes a generic interface
  which makes sense regardless of hardware details and then each driver can
  implement the configured control in whatever way they can. However, even
  for that, there doesn't seem much buy-in from other drivers.

* This proposal seems narrowly scoped trying to solve a specific problem
  which may not translate to different hardware configurations. Please let
  me know if I got that wrong, but if that's the case, I think a better and
  easier approach might be just being a part of the misc controller. That
  doesn't require much extra code and should be able to provide everything
  necessary for statically limiting specific resources.

Thanks.

-- 
tejun


Re: Selecting CPUs for queuing work on

2022-08-12 Thread Tejun Heo
Hello,

On Fri, Aug 12, 2022 at 04:54:04PM -0400, Felix Kuehling wrote:
> In principle, I think IRQ routing to CPUs can change dynamically with
> irqbalance.

I wonder whether this is something which should be exposed to userland
rather than trying to do dynamically in the kernel and let irqbalance or
whatever deal with it. People use irq affinity to steer these handlings to
specfic CPUs and the usual expectation is that the bottom half handling is
gonna take place on the same cpu usually through softirq. It's kinda awkard
to have this secondary assignment happening implicitly.

> What we need is kind of the opposite of WQ_UNBOUND. As I understand it,
> WQ_UNBOUND can schedule anywhere to maximize concurrency. What we need is to
> schedule to very specific, predictable CPUs. We only have one work item per
> GPU that processes all the interrupts in order, so we don't need the
> concurrency of WQ_UNBOUND.

Each WQ_UNBOUND workqueue has a cpumask associated with it and the cpumask
can be changed dynamically, so it can be used for sth like this, but I'm not
yet convinced that's the right thing to do.

Thanks.

-- 
tejun


Re: Selecting CPUs for queuing work on

2022-08-12 Thread Tejun Heo
On Fri, Aug 12, 2022 at 04:26:47PM -0400, Felix Kuehling wrote:
> Hi workqueue maintainers,
> 
> In the KFD (amdgpu) driver we found a need to schedule bottom half interrupt
> handlers on CPU cores different from the one where the top-half interrupt
> handler runs to avoid the interrupt handler stalling the bottom half in
> extreme scenarios. See my latest patch that tries to use a different
> hyperthread on the same CPU core, or falls back to a different core in the
> same NUMA node if that fails:
> https://lore.kernel.org/all/20220811190433.1213179-1-felix.kuehl...@amd.com/
> 
> Dave pointed out that the driver may not be the best place to implement such
> logic and suggested that we should have an abstraction, maybe in the
> workqueue code. Do you feel this is something that could or should be
> provided by the core workqueue code? Or maybe some other place?

I'm not necessarily against it. I guess it can be a flag on an unbound wq.
Do the interrupts move across different CPUs tho? ie. why does this need to
be a dynamic decision?

Thanks.

-- 
tejun


Re: [PATCH] Revert "workqueue: remove unused cancel_work()"

2022-06-07 Thread Tejun Heo
On Tue, Jun 07, 2022 at 01:39:01PM -0400, Alex Deucher wrote:
> On Tue, Jun 7, 2022 at 1:14 PM Tejun Heo  wrote:
> >
> > On Sat, May 21, 2022 at 12:04:00AM -0400, Andrey Grodzovsky wrote:
> > > From 78df30cc97f10c885f5159a293e6afe2348aa60c Mon Sep 17 00:00:00 2001
> > > From: Andrey Grodzovsky 
> > > Date: Thu, 19 May 2022 09:47:28 -0400
> > > Subject: Revert "workqueue: remove unused cancel_work()"
> > >
> > > This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f.
> > >
> > > amdpgu need this function in order to prematurly stop pending
> > > reset works when another reset work already in progress.
> > >
> > > Signed-off-by: Andrey Grodzovsky 
> >
> > Applied to wq/for-5.19-fixes.
> 
> Could we take it through the drm tree so we can include it with
> Andrey's patches that depend on it?

Oh sure, please go ahead. Imma revert from my tree.

 Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] Revert "workqueue: remove unused cancel_work()"

2022-06-07 Thread Tejun Heo
On Sat, May 21, 2022 at 12:04:00AM -0400, Andrey Grodzovsky wrote:
> From 78df30cc97f10c885f5159a293e6afe2348aa60c Mon Sep 17 00:00:00 2001
> From: Andrey Grodzovsky 
> Date: Thu, 19 May 2022 09:47:28 -0400
> Subject: Revert "workqueue: remove unused cancel_work()"
> 
> This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f.
> 
> amdpgu need this function in order to prematurly stop pending
> reset works when another reset work already in progress.
> 
> Signed-off-by: Andrey Grodzovsky 

Applied to wq/for-5.19-fixes.

Thanks.

-- 
tejun


Re: [PATCH] Revert "workqueue: remove unused cancel_work()"

2022-05-20 Thread Tejun Heo
On Fri, May 20, 2022 at 08:22:39AM +0200, Christian König wrote:
> Am 20.05.22 um 02:47 schrieb Lai Jiangshan:
> > On Thu, May 19, 2022 at 11:04 PM Andrey Grodzovsky
> >  wrote:
> > > See this patch-set https://www.spinics.net/lists/amd-gfx/msg78514.html, 
> > > specifically patch
> > > 'drm/amdgpu: Switch to delayed work from work_struct.
> > > 
> > > I will just reiterate here -
> > > 
> > > We need to be able to do non blocking cancel pending reset works
> > > from within GPU reset. Currently kernel API allows this only
> > > for delayed_work and not for work_struct.
> > > 
> > I'm OK with the change.
> > 
> > With an updated changelog:
> > 
> > Reviewed-by: Lai Jiangshan
> 
> Good morning guys,
> 
> for the patch itself Reviewed-by: Christian König 
> 
> And just for the record: We plan to push this upstream through the drm
> branches, if anybody has any objections to that please speak up.

Andrey, care to resend with updated description?

Thanks.

-- 
tejun


Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL

2021-05-07 Thread Tejun Heo
Hello,

On Fri, May 07, 2021 at 06:30:56PM -0400, Alex Deucher wrote:
> Maybe we are speaking past each other.  I'm not following.  We got
> here because a device specific cgroup didn't make sense.  With my
> Linux user hat on, that makes sense.  I don't want to write code to a
> bunch of device specific interfaces if I can avoid it.  But as for
> temporal vs spatial partitioning of the GPU, the argument seems to be
> a sort of hand-wavy one that both spatial and temporal partitioning
> make sense on CPUs, but only temporal partitioning makes sense on
> GPUs.  I'm trying to understand that assertion.  There are some GPUs

Spatial partitioning as implemented in cpuset isn't a desirable model. It's
there partly because it has historically been there. It doesn't really
require dynamic hierarchical distribution of anything and is more of a way
to batch-update per-task configuration, which is how it's actually
implemented. It's broken too in that it interferes with per-task affinity
settings. So, not exactly a good example to follow. In addition, this sort
of partitioning requires more hardware knowledge and GPUs are worse than
CPUs in that hardwares differ more.

Features like this are trivial to implement from userland side by making
per-process settings inheritable and restricting who can update the
settings.

> that can more easily be temporally partitioned and some that can be
> more easily spatially partitioned.  It doesn't seem any different than
> CPUs.

Right, it doesn't really matter how the resource is distributed. What
matters is how granular and generic the distribution can be. If gpus can
implement work-conserving proportional distribution, that's something which
is widely useful and inherently requires dynamic scheduling from kernel
side. If it's about setting per-vendor affinities, this is way too much
cgroup interface for a feature which can be easily implemented outside
cgroup. Just do per-process (or whatever handles gpus use) and confine their
configurations from cgroup side however way.

While the specific theme changes a bit, we're basically having the same
discussion with the same conclusion over the past however many months.
Hopefully, the point is clear by now.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL

2021-05-07 Thread Tejun Heo
Hello,

On Fri, May 07, 2021 at 03:55:39PM -0400, Alex Deucher wrote:
> The problem is temporal partitioning on GPUs is much harder to enforce
> unless you have a special case like SR-IOV.  Spatial partitioning, on
> AMD GPUs at least, is widely available and easily enforced.  What is
> the point of implementing temporal style cgroups if no one can enforce
> it effectively?

So, if generic fine-grained partitioning can't be implemented, the right
thing to do is stopping pushing for full-blown cgroup interface for it. The
hardware simply isn't capable of being managed in a way which allows generic
fine-grained hierarchical scheduling and there's no point in bloating the
interface with half baked hardware dependent features.

This isn't to say that there's no way to support them, but what have been
being proposed is way too generic and ambitious in terms of interface while
being poorly developed on the internal abstraction and mechanism front. If
the hardware can't do generic, either implement the barest minimum interface
(e.g. be a part of misc controller) or go driver-specific - the feature is
hardware specific anyway. I've repeated this multiple times in these
discussions now but it'd be really helpful to try to minimize the interace
while concentrating more on internal abstractions and actual control
mechanisms.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [RFC] Add BPF_PROG_TYPE_CGROUP_IOCTL

2021-05-07 Thread Tejun Heo
Hello,

On Fri, May 07, 2021 at 06:54:13PM +0200, Daniel Vetter wrote:
> All I meant is that for the container/cgroups world starting out with
> time-sharing feels like the best fit, least because your SRIOV designers
> also seem to think that's the best first cut for cloud-y computing.
> Whether it's virtualized or containerized is a distinction that's getting
> ever more blurry, with virtualization become a lot more dynamic and
> container runtimes als possibly using hw virtualization underneath.

FWIW, I'm completely on the same boat. There are two fundamental issues with
hardware-mask based control - control granularity and work conservation.
Combined, they make it a significantly more difficult interface to use which
requires hardware-specific tuning rather than simply being able to say "I
wanna prioritize this job twice over that one".

My knoweldge of gpus is really limited but my understanding is also that the
gpu cores and threads aren't as homogeneous as the CPU counterparts across
the vendors, product generations and possibly even within a single chip,
which makes the problem even worse.

Given that GPUs are time-shareable to begin with, the most universal
solution seems pretty clear.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v2 00/11] new cgroup controller for gpu/drm subsystem

2020-04-13 Thread Tejun Heo
Hello,

On Mon, Apr 13, 2020 at 05:40:32PM -0400, Kenny Ho wrote:
> By lack of consense, do you mean Intel's assertion that a standard is
> not a standard until Intel implements it? (That was in the context of
> OpenCL language standard with the concept of SubDevice.)  I thought
> the discussion so far has established that the concept of a compute
> unit, while named differently (AMD's CUs, ARM's SCs, Intel's EUs,
> Nvidia's SMs, Qualcomm's SPs), is cross vendor.  While an AMD CU is
> not the same as an Intel EU or Nvidia SM, the same can be said for CPU
> cores.  If cpuset is acceptable for a diversity of CPU core designs
> and arrangements, I don't understand why an interface derived from GPU
> SubDevice is considered premature.

CPUs are a lot more uniform across vendors than GPUs and have way higher user
observability and awareness. And, even then, it's something which has limited
usefulness because the configuration is inherently more complex involving
topology details and the end result is not work-conserving.

cpuset is there partly due to historical reasons and its features can often be
trivially replicated with some scripting around taskset. If that's all you're
trying to add, I don't see why it needs to be in cgroup at all. Just implement
a tool similar to taskset and build sufficient tooling around it. Given how
hardware specific it can become, that is likely the better direction anyway.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v2 00/11] new cgroup controller for gpu/drm subsystem

2020-04-13 Thread Tejun Heo
Hello,

On Mon, Apr 13, 2020 at 04:17:14PM -0400, Kenny Ho wrote:
> Perhaps we can even narrow things down to just
> gpu.weight/gpu.compute.weight as a start?  In this aspect, is the key

That sounds great to me.

> objection to the current implementation of gpu.compute.weight the
> work-conserving bit?  This work-conserving requirement is probably
> what I have missed for the last two years (and hence going in circle.)
> 
> If this is the case, can you clarify/confirm the followings?
> 
> 1) Is resource scheduling goal of cgroup purely for the purpose of
> throughput?  (at the expense of other scheduling goals such as
> latency.)

It's not; however, work-conserving mechanisms are the easiest to use (cuz you
don't lose anything) while usually challenging to implement. It tends to
clarify how control mechanisms should be structured - even what resources are.

> 2) If 1) is true, under what circumstances will the "Allocations"
> resource distribution model (as defined in the cgroup-v2) be
> acceptable?

Allocations definitely are acceptable and it's not a pre-requisite to have
work-conserving control first either. Here, given the lack of consensus in
terms of what even constitute resource units, I don't think it'd be a good
idea to commit to the proposed interface and believe it'd be beneficial to
work on interface-wise simpler work conserving controls.

> 3) If 1) is true, are things like cpuset from cgroup v1 no longer
> acceptable going forward?

Again, they're acceptable.

> To be clear, while some have framed this (time sharing vs spatial
> sharing) as a partisan issue, it is in fact a technical one.  I have
> implemented the gpu cgroup support this way because we have a class of
> users that value low latency/low jitter/predictability/synchronicity.
> For example, they would like 4 tasks to share a GPU and they would
> like the tasks to start and finish at the same time.
> 
> What is the rationale behind picking the Weight model over Allocations
> as the first acceptable implementation?  Can't we have both
> work-conserving and non-work-conserving ways of distributing GPU
> resources?  If we can, why not allow non-work-conserving
> implementation first, especially when we have users asking for such
> functionality?

I hope the rationales are clear now. What I'm objecting is inclusion of
premature interface, which is a lot easier and more tempting to do for
hardware-specific limits and the proposals up until now have been showing
ample signs of that. I don't think my position has changed much since the
beginning - do the difficult-to-implement but easy-to-use weights first and
then you and everyone would have a better idea of what hard-limit or
allocation interfaces and mechanisms should look like, or even whether they're
needed.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v2 00/11] new cgroup controller for gpu/drm subsystem

2020-04-13 Thread Tejun Heo
Hello, Kenny.

On Tue, Mar 24, 2020 at 02:49:27PM -0400, Kenny Ho wrote:
> Can you elaborate more on what are the missing pieces?

Sorry about the long delay, but I think we've been going in circles for quite
a while now. Let's try to make it really simple as the first step. How about
something like the following?

* gpu.weight (should it be gpu.compute.weight? idk) - A single number
  per-device weight similar to io.weight, which distributes computation
  resources in work-conserving way.

* gpu.memory.high - A single number per-device on-device memory limit.

The above two, if works well, should already be plenty useful. And my guess is
that getting the above working well will be plenty challenging already even
though it's already excluding work-conserving memory distribution. So, let's
please do that as the first step and see what more would be needed from there.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] device_cgroup: Cleanup cgroup eBPF device filter code

2020-04-13 Thread Tejun Heo
On Fri, Apr 03, 2020 at 07:55:28PM +0200, Odin Ugedal wrote:
> Original cgroup v2 eBPF code for filtering device access made it
> possible to compile with CONFIG_CGROUP_DEVICE=n and still use the eBPF
> filtering. Change 
> commit 4b7d4d453fc4 ("device_cgroup: Export devcgroup_check_permission")
> reverted this, making it required to set it to y.
> 
> Since the device filtering (and all the docs) for cgroup v2 is no longer
> a "device controller" like it was in v1, someone might compile their
> kernel with CONFIG_CGROUP_DEVICE=n. Then (for linux 5.5+) the eBPF
> filter will not be invoked, and all processes will be allowed access
> to all devices, no matter what the eBPF filter says.

Applied to cgroup/for-5.7-fixes.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v2 00/11] new cgroup controller for gpu/drm subsystem

2020-03-24 Thread Tejun Heo
On Tue, Mar 17, 2020 at 12:03:20PM -0400, Kenny Ho wrote:
> What's your thoughts on this latest series?

My overall impression is that the feedbacks aren't being incorporated throughly
/ sufficiently.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 09/11] drm, cgroup: Introduce lgpu as DRM cgroup resource

2020-02-14 Thread Tejun Heo
On Fri, Feb 14, 2020 at 03:28:40PM -0500, Kenny Ho wrote:
> Can you elaborate, per your understanding, how the lgpu weight
> attribute differ from the io.weight you suggested?  Is it merely a

Oh, it's the non-weight part which is problematic.

> formatting/naming issue or is it the implementation details that you
> find troubling?  From my perspective, the weight attribute implements
> as you suggested back in RFCv4 (proportional control on top of a unit
> - either physical or time unit.)
> 
> Perhaps more explicit questions would help me understand what you
> mean. If I remove the 'list' and 'count' attributes leaving just
> weight, is that satisfactory?  Are you saying the idea of affinity or

At least from interface pov, yes, although I think it should be clear
what the weight controls.

> named-resource is banned from cgroup entirely (even though it exists
> in the form of cpuset already and users are interested in having such
> options [i.e. userspace OpenCL] when needed?)
> 
> To be clear, I am not saying no proportional control.  I am saying
> give the user the options, which is what has been implemented.

We can get there if we *really* have to but not from the get-go but
I'd rather avoid affinities if at all possible.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 09/11] drm, cgroup: Introduce lgpu as DRM cgroup resource

2020-02-14 Thread Tejun Heo
Hello, Kenny, Daniel.

(cc'ing Johannes)

On Fri, Feb 14, 2020 at 01:51:32PM -0500, Kenny Ho wrote:
> On Fri, Feb 14, 2020 at 1:34 PM Daniel Vetter  wrote:
> >
> > I think guidance from Tejun in previos discussions was pretty clear that
> > he expects cgroups to be both a) standardized and c) sufficient clear
> > meaning that end-users have a clear understanding of what happens when
> > they change the resource allocation.
> >
> > I'm not sure lgpu here, at least as specified, passes either.
> 
> I disagree (at least on the characterization of the feedback
> provided.)  I believe this series satisfied the sprite of Tejun's
> guidance so far (the weight knob for lgpu, for example, was
> specifically implemented base on his input.)  But, I will let Tejun
> speak for himself after he considered the implementation in detail.

I have to agree with Daniel here. My apologies if I weren't clear
enough. Here's one interface I can think of:

 * compute weight: The same format as io.weight. Proportional control
   of gpu compute.

 * memory low: Please see how the system memory.low behaves. For gpus,
   it'll need per-device entries.

Note that for both, there one number to configure and conceptually
it's pretty clear to everybody what that number means, which is not to
say that it's clear to implement but it's much better to deal with
that on this side of the interface than the other.

cc'ing Johannes. Do you have anything on mind regarding how gpu memory
configuration should look like? e.g. should it go w/ weights rather
than absoulte units (I don't think so given that it'll most likely
need limits at some point too but still and there are benefits from
staying consistent with system memory).

Also, a rather trivial high level question. Is drm a good controller
name given that other controller names are like cpu, memory, io?

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC v4 02/16] cgroup: Introduce cgroup for drm subsystem

2019-12-02 Thread Tejun Heo
On Fri, Nov 29, 2019 at 01:00:36AM -0500, Kenny Ho wrote:
> On Tue, Oct 1, 2019 at 10:31 AM Michal Koutný  wrote:
> > On Thu, Aug 29, 2019 at 02:05:19AM -0400, Kenny Ho  wrote:
> > > +struct cgroup_subsys drm_cgrp_subsys = {
> > > + .css_alloc  = drmcg_css_alloc,
> > > + .css_free   = drmcg_css_free,
> > > + .early_init = false,
> > > + .legacy_cftypes = files,
> > Do you really want to expose the DRM controller on v1 hierarchies (where
> > threads of one process can be in different cgroups, or children cgroups
> > compete with their parents)?
> 
> (Sorry for the delay, I have been distracted by something else.)
> Yes, I am hoping to make the functionality as widely available as
> possible since the ecosystem is still transitioning to v2.  Do you see
> inherent problem with this approach?

Integrating with memcg could be more challenging on cgroup1.  That's
one of the reasons why e.g. cgroup-aware pagecache writeback is only
on cgroup2.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-10 Thread Tejun Heo
Hello, Michal.

On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> > So, while it'd great to have shrinkers in the longer term, it's not a
> > strict requirement to be accounted in memcg.  It already accounts a
> > lot of memory which isn't reclaimable (a lot of slabs and socket
> > buffer).
> 
> Yeah, having a shrinker is preferred but the memory should better be
> reclaimable in some form. If not by any other means then at least bound
> to a user process context so that it goes away with a task being killed
> by the OOM killer. If that is not the case then we cannot really charge
> it because then the memcg controller is of no use. We can tolerate it to
> some degree if the amount of memory charged like that is negligible to
> the overall size. But from the discussion it seems that these buffers
> are really large.

Yeah, oom kills should be able to reduce the usage; however, please
note that tmpfs, among other things, can already escape this
restriction and we can have cgroups which are over max and empty.
It's obviously not ideal but the system doesn't melt down from it
either.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-06 Thread Tejun Heo
Hello, Daniel.

On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > Hmm... what'd be the fundamental difference from slab or socket memory
> > which are handled through memcg?  Is system memory used by GPUs have
> > further global restrictions in addition to the amount of physical
> > memory used?
> 
> Sometimes, but that would be specific resources (kinda like vram),
> e.g. CMA regions used by a gpu. But probably not something you'll run
> in a datacenter and want cgroups for ...
> 
> I guess we could try to integrate with the memcg group controller. One
> trouble is that aside from i915 most gpu drivers do not really have a
> full shrinker, so not sure how that would all integrate.

So, while it'd great to have shrinkers in the longer term, it's not a
strict requirement to be accounted in memcg.  It already accounts a
lot of memory which isn't reclaimable (a lot of slabs and socket
buffer).

> The overall gpu memory controller would still be outside of memcg I
> think, since that would include swapped-out gpu objects, and stuff in
> special memory regions like vram.

Yeah, for resources which are on the GPU itself or hard limitations
arising from it.  In general, we wanna make cgroup controllers control
something real and concrete as in physical resources.

> > At the system level, it just gets folded into cpu time, which isn't
> > perfect but is usually a good enough approximation of compute related
> > dynamic resources.  Can gpu do someting similar or at least start with
> > that?
> 
> So generally there's a pile of engines, often of different type (e.g.
> amd hw has an entire pile of copy engines), with some ill-defined
> sharing charateristics for some (often compute/render engines use the
> same shader cores underneath), kinda like hyperthreading. So at that
> detail it's all extremely hw specific, and probably too hard to
> control in a useful way for users. And I'm not sure we can really do a
> reasonable knob for overall gpu usage, e.g. if we include all the copy
> engines, but the workloads are only running on compute engines, then
> you might only get 10% overall utilization by engine-time. While the
> shaders (which is most of the chip area/power consumption) are
> actually at 100%. On top, with many userspace apis those engines are
> an internal implementation detail of a more abstract gpu device (e.g.
> opengl), but with others, this is all fully exposed (like vulkan).
> 
> Plus the kernel needs to use at least copy engines for vram management
> itself, and you really can't take that away. Although Kenny here has
> some proposal for a separate cgroup resource just for that.
> 
> I just think it's all a bit too ill-defined, and we might be better
> off nailing the memory side first and get some real world experience
> on this stuff. For context, there's not even a cross-driver standard
> for how priorities are handled, that's all driver-specific interfaces.

I see.  Yeah, figuring it out as this develops makes sense to me.  One
thing I wanna raise is that in general we don't want to expose device
or implementation details in cgroup interface.  What we want expressed
there is the intentions of the user.  The more internal details we
expose the more we end up getting tied down to the specific
implementation which we should avoid especially given the early stage
of development.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 01/16] drm: Add drm_minor_for_each

2019-09-06 Thread Tejun Heo
Hello, Daniel.

On Fri, Sep 06, 2019 at 05:36:02PM +0200, Daniel Vetter wrote:
> Block devices are a great example I think. How do you handle the
> partitions on that? For drm we also have a main minor interface, and

cgroup IO controllers only distribute hardware IO capacity and are
blind to partitions.  As there's always the whole device MAJ:MIN for
block devices, we only use that.

> then the render-only interface on drivers that support it. So if blkcg
> handles that by only exposing the primary maj:min pair, I think we can
> go with that and it's all nicely consistent.

Ah yeah, that sounds equivalent.  Great.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 01/16] drm: Add drm_minor_for_each

2019-09-06 Thread Tejun Heo
Hello,

On Wed, Sep 04, 2019 at 10:54:34AM +0200, Daniel Vetter wrote:
> Anyway, I don't think reusing the drm_minor registration makes sense,
> since we want to be on the drm_device, not on the minor. Which is a bit
> awkward for cgroups, which wants to identify devices using major.minor
> pairs. But I guess drm is the first subsystem where 1 device can be
> exposed through multiple minors ...
> 
> Tejun, any suggestions on this?

I'm not extremely attached to maj:min.  It's nice in that it'd be
consistent with blkcg but it already isn't the nicest of identifiers
for block devices.  If using maj:min is reasonably straight forward
for gpus even if not perfect, I'd prefer going with maj:min.
Otherwise, please feel free to use the ID best for GPUs - hopefully
something which is easy to understand, consistent with IDs used
elsewhere and easy to build tooling around.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-06 Thread Tejun Heo
Hello, Daniel.

On Tue, Sep 03, 2019 at 09:48:22PM +0200, Daniel Vetter wrote:
> I think system memory separate from vram makes sense. For one, vram is
> like 10x+ faster than system memory, so we definitely want to have
> good control on that. But maybe we only want one vram bucket overall
> for the entire system?
> 
> The trouble with system memory is that gpu tasks pin that memory to
> prep execution. There's two solutions:
> - i915 has a shrinker. Lots (and I really mean lots) of pain with
> direct reclaim recursion, which often means we can't free memory, and
> we're angering the oom killer a lot. Plus it introduces real bad
> latency spikes everywhere (gpu workloads are occasionally really slow,
> think "worse than pageout to spinning rust" to get memory freed).
> - ttm just has a global limit, set to 50% of system memory.
> 
> I do think a global system memory limit to tame the shrinker, without
> the ttm approach of possible just wasting half your memory, could be
> useful.

Hmm... what'd be the fundamental difference from slab or socket memory
which are handled through memcg?  Is system memory used by GPUs have
further global restrictions in addition to the amount of physical
memory used?

> I'm also not sure of the bw limits, given all the fun we have on the
> block io cgroups side. Aside from that the current bw limit only
> controls the bw the kernel uses, userspace can submit unlimited
> amounts of copying commands that use the same pcie links directly to
> the gpu, bypassing this cg knob. Also, controlling execution time for
> gpus is very tricky, since they work a lot more like a block io device
> or maybe a network controller with packet scheduling, than a cpu.

At the system level, it just gets folded into cpu time, which isn't
perfect but is usually a good enough approximation of compute related
dynamic resources.  Can gpu do someting similar or at least start with
that?

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-03 Thread Tejun Heo
Hello, Daniel.

On Tue, Sep 03, 2019 at 09:55:50AM +0200, Daniel Vetter wrote:
> > * While breaking up and applying control to different types of
> >   internal objects may seem attractive to folks who work day in and
> >   day out with the subsystem, they aren't all that useful to users and
> >   the siloed controls are likely to make the whole mechanism a lot
> >   less useful.  We had the same problem with cgroup1 memcg - putting
> >   control of different uses of memory under separate knobs.  It made
> >   the whole thing pretty useless.  e.g. if you constrain all knobs
> >   tight enough to control the overall usage, overall utilization
> >   suffers, but if you don't, you really don't have control over actual
> >   usage.  For memcg, what has to be allocated and controlled is
> >   physical memory, no matter how they're used.  It's not like you can
> >   go buy more "socket" memory.  At least from the looks of it, I'm
> >   afraid gpu controller is repeating the same mistakes.
> 
> We do have quite a pile of different memories and ranges, so I don't
> thinkt we're doing the same mistake here. But it is maybe a bit too

I see.  One thing which caught my eyes was the system memory control.
Shouldn't that be controlled by memcg?  Is there something special
about system memory used by gpus?

> complicated, and exposes stuff that most users really don't care about.

Could be from me not knowing much about gpus but definitely looks too
complex to me.  I don't see how users would be able to alloate, vram,
system memory and GART with reasonable accuracy.  memcg on cgroup2
deals with just single number and that's already plenty challenging.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-08-30 Thread Tejun Heo
Hello,

I just glanced through the interface and don't have enough context to
give any kind of detailed review yet.  I'll try to read up and
understand more and would greatly appreciate if you can give me some
pointers to read up on the resources being controlled and how the
actual use cases would look like.  That said, I have some basic
concerns.

* TTM vs. GEM distinction seems to be internal implementation detail
  rather than anything relating to underlying physical resources.
  Provided that's the case, I'm afraid these internal constructs being
  used as primary resource control objects likely isn't the right
  approach.  Whether a given driver uses one or the other internal
  abstraction layer shouldn't determine how resources are represented
  at the userland interface layer.

* While breaking up and applying control to different types of
  internal objects may seem attractive to folks who work day in and
  day out with the subsystem, they aren't all that useful to users and
  the siloed controls are likely to make the whole mechanism a lot
  less useful.  We had the same problem with cgroup1 memcg - putting
  control of different uses of memory under separate knobs.  It made
  the whole thing pretty useless.  e.g. if you constrain all knobs
  tight enough to control the overall usage, overall utilization
  suffers, but if you don't, you really don't have control over actual
  usage.  For memcg, what has to be allocated and controlled is
  physical memory, no matter how they're used.  It's not like you can
  go buy more "socket" memory.  At least from the looks of it, I'm
  afraid gpu controller is repeating the same mistakes.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 4/4] drm/amdkfd: Check against device cgroup

2019-05-29 Thread Tejun Heo
On Wed, May 29, 2019 at 08:45:44PM +, Kuehling, Felix wrote:
> Just to clarify, are you saying that we should upstream this change 
> through Alex Deucher's amd-staging-drm-next and Dave Airlie's drm-next 
> trees?

Yeah, sure, whichever tree is the most convenient.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 4/4] drm/amdkfd: Check against device cgroup

2019-05-28 Thread Tejun Heo
Hello,

On Fri, May 17, 2019 at 08:12:17PM +, Kuehling, Felix wrote:
> Patches 1,2,4 will be submitted through amd-staging-drm-next. Patch 3 
> goes through the cgroup tree. Patch 4 depends on patch 3. So submitting 
> patch 4 will need to wait until we rebase amd-staging-drm-next on a new 
> enough kernel release that includes patch 3.
> 
> Patch 1 and 2 could be submitted now or wait for patch 3 as well so we 
> submit all our cgroup stuff in amdgpu and KFD together. It probably 
> makes most sense to wait since unused code tends to rot.
> 
> Patches 1,2,4 are already reviewed by me. Feel free to add my Acked-by 
> to patch 3.

Please feel free to add my acked-by and take patch 3 with the rest of
the patchset.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 0/4] AMDKFD (AMD GPU compute) support for device cgroup.

2019-05-28 Thread Tejun Heo
Hello,

On Fri, May 17, 2019 at 08:04:42PM +, Kasiviswanathan, Harish wrote:
> 1). Documentation for user on how to use device cgroup for amdkfd device. I 
> have some more information on this in patch 4. 

I see.  Yeah, I just missed that.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH v2 0/4] AMDKFD (AMD GPU compute) support for device cgroup.

2019-05-17 Thread Tejun Heo
On Fri, May 17, 2019 at 04:14:52PM +, Kasiviswanathan, Harish wrote:
> amdkfd (part of amdgpu) driver supports the AMD GPU compute stack.
> amdkfd exposes only a single device /dev/kfd even if multiple AMD GPU
> (compute) devices exist in a system. However, amdgpu drvier exposes a
> separate render device file /dev/dri/renderDN for each device. To participate
> in device cgroup amdkfd driver will rely on these redner device files.
> 
> v2: Exporting devcgroup_check_permission() instead of
> __devcgroup_check_permission() as per review comments.

Looks fine to me but given how non-obvious it is, some documentation
would be great.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [RFC PATCH v2 4/5] drm, cgroup: Add total GEM buffer allocation limit

2019-05-16 Thread Tejun Heo
Hello,

I haven't gone through the patchset yet but some quick comments.

On Wed, May 15, 2019 at 10:29:21PM -0400, Kenny Ho wrote:
> Given this controller is specific to the drm kernel subsystem which
> uses minor to identify drm device, I don't see a need to complicate
> the interfaces more by having major and a key.  As you can see in the
> examples below, the drm device minor corresponds to the line number.
> I am not sure how strict cgroup upstream is about the convention but I

We're pretty strict.

> am hoping there are flexibility here to allow for what I have
> implemented.  There are a couple of other things I have done that is

So, please follow the interface conventions.  We can definitely add
new ones but that would need functional reasons.

> not described in the convention: 1) inclusion of read-only *.help file
> at the root cgroup, 2) use read-only (which I can potentially make rw)
> *.default file instead of having a default entries (since the default
> can be different for different devices) inside the control files (this
> way, the resetting of cgroup values for all the drm devices, can be
> done by a simple 'cp'.)

Again, please follow the existing conventions.  There's a lot more
harm than good in every controller being creative in their own way.
It's trivial to build convenience features in userspace.  Please do it
there.

> > Is this really useful for an administrator to control?
> > Isn't the resource we want to control actually the physical backing store?
> That's correct.  This is just the first level of control since the
> backing store can be backed by different type of memory.  I am in the
> process of adding at least two more resources.  Stay tuned.  I am
> doing the charge here to enforce the idea of "creator is deemed owner"
> at a place where the code is shared by all (the init function.)

Ideally, controller should only control hard resources which impact
behaviors and performance which are immediately visible to users.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC 2/5] cgroup: Add mechanism to register vendor specific DRM devices

2018-11-20 Thread Tejun Heo
Hello,

On Tue, Nov 20, 2018 at 10:21:14PM +, Ho, Kenny wrote:
> By this reply, are you suggesting that vendor specific resources
> will never be acceptable to be managed under cgroup?  Let say a user

I wouldn't say never but whatever which gets included as a cgroup
controller should have clearly defined resource abstractions and the
control schemes around them including support for delegation.  AFAICS,
gpu side still seems to have a long way to go (and it's not clear
whether that's somewhere it will or needs to end up).

> want to have similar functionality as what cgroup is offering but to
> manage vendor specific resources, what would you suggest as a
> solution?  When you say keeping vendor specific resource regulation
> inside drm or specific drivers, do you mean we should replicate the
> cgroup infrastructure there or do you mean either drm or specific
> driver should query existing hierarchy (such as device or perhaps
> cpu) for the process organization information?
> 
> To put the questions in more concrete terms, let say a user wants to
> expose certain part of a gpu to a particular cgroup similar to the
> way selective cpu cores are exposed to a cgroup via cpuset, how
> should we go about enabling such functionality?

Do what the intel driver or bpf is doing?  It's not difficult to hook
into cgroup for identification purposes.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH RFC 2/5] cgroup: Add mechanism to register vendor specific DRM devices

2018-11-20 Thread Tejun Heo
Hello,

On Tue, Nov 20, 2018 at 01:58:11PM -0500, Kenny Ho wrote:
> Since many parts of the DRM subsystem has vendor-specific
> implementations, we introduce mechanisms for vendor to register their
> specific resources and control files to the DRM cgroup subsystem.  A
> vendor will register itself with the DRM cgroup subsystem first before
> registering individual DRM devices to the cgroup subsystem.
> 
> In addition to the cgroup_subsys_state that is common to all DRM
> devices, a device-specific state is introduced and it is allocated
> according to the vendor of the device.

So, I'm still pretty negative about adding drm controller at this
point.  There isn't enough of common resource model defined yet and
until that gets sorted out I think it's in the best interest of
everyone involved to keep it inside drm or specific driver proper.

Thanks.

-- 
tejun
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx