drm: Expose memory stats

Tvrtko Ursulin Thu, 27 Jul 2023 10:08:43 -0700


On 27/07/2023 12:54, Maarten Lankhorst wrote:

Hey,

On 2023-07-26 13:41, Tvrtko Ursulin wrote:
On 26/07/2023 11:14, Maarten Lankhorst wrote:
Hey,

On 2023-07-22 00:21, Tejun Heo wrote:
On Wed, Jul 12, 2023 at 12:46:04PM +0100, Tvrtko Ursulin wrote:
   $ cat drm.memory.stat
card0 region=system total=12898304 shared=0 active=0resident=12111872 purgeable=167936 card0 region=stolen-system total=0 shared=0 active=0 resident=0purgeable=0
Data is generated on demand for simplicty of implementation ie. norunning
totals are kept or accounted during migrations and such. Various
optimisations such as cheaper collection of data are possible but
deliberately left out for now.

Overall, the feature is deemed to be useful to container orchestration
software (and manual management).
Limits, either soft or hard, are not envisaged to be implemented ontop of
this approach due on demand nature of collecting the stats.
So, yeah, if you want to add memory controls, we better thinkthrough how
the fd ownership migration should work.
I've taken a look at the series, since I have been working on cgroupmemory eviction.
The scheduling stuff will work for i915, since it has a purelysoftware execlist scheduler, but I don't think it will work for GuC(firmware) scheduling or other drivers that use the generic drmscheduler.
It actually works - I used to have a blurb in the cover letter aboutit but apparently I dropped it. Just a bit less well with manyclients, since there are fewer priority levels.
All that the design requires from the invididual drivers is some wayto react to the "you are over budget by this much" signal. The rest isdriver and backend specific.
What I mean is that this signal may not be applicable since the drmscheduler just schedules jobs that run. Adding a weight might be done inhardware, since it's responsible for scheduling which context gets torun. The over budget signal is useless in that case, and you just needto set a scheduling priority for the hardware instead.

The over budget callback lets the driver know its assigned budget andits current utilisation. Already with that data drivers could implementsomething smarter than what I did in my RFC. So I don't think callbackis completely useless even for some smarter implementation whichpotentially ties into firmware scheduling.


Anyway, I maintain this is implementation details.

For something like this, you would probably want it to work insidethe drm scheduler first. Presumably, this can be done by setting aweight on each runqueue, and perhaps adding a callback to update onefor a running queue. Calculating the weights hierarchically might befun..
It is not needed to work in drm scheduler first. In fact drm schedulerbased drivers can plug into what I have since it already has thenotion of scheduling priorities.
They would only need to implement a hook which allow the cgroupcontroller to query client GPU utilisation and another to received theover budget signal.
Amdgpu and msm AFAIK could be easy candidates because they bothsupport per client utilisation and priorities.
Looks like I need to put all this info back into the cover letter.
Also, hierarchic weights and time budgets are all already there. Whatcould be done later is make this all smarter and respect the timebudget with more precision. That would however, in many casesincluding Intel, require co-operation with the firmware. In any caseit is only work in the implementation, while the cgroup controlinterface remains the same.
I have taken a look at how the rest of cgroup controllers changeownership when moved to a different cgroup, and the answer was: notat all. If we attempt to create the scheduler controls only on thefirst time the fd is used, you could probably get rid of all thetracking.
Can you send a CPU file descriptor from process A to process B andhave CPU usage belonging to process B show up in process' A cgroup, orvice-versa? Nope, I am not making any sense, am I? My point being itis not like-to-like, model is different.
No ownership transfer would mean in wide deployments all GPUutilisation would be assigned to Xorg and so there is no point to anyof this. No way to throttle a cgroup with un-important GPU clients forinstance.
If you just grab the current process' cgroup when a drm_sched_entity iscreated, you don't have everything charged to X.org. No need forcomplicated ownership tracking in drm_file. The same equivalent shouldbe done in i915 as well when a context is created as it's not using thedrm scheduler.

Okay so essentially nuking the concept of DRM clients belongs to onecgroup and instead tracking at the context level. That is an interestingidea. I suspect implementation could require somewhat generalizing theconcept of an "execution context", or at least expressing it via the DRMcgroup controller.

I can give this a spin, or at least some more detailed thought, once weclose on a few more details regarding charging in general.

This can be done very easily with the drm scheduler.
WRT memory, I think the consensus is to track system memory likenormal memory. Stolen memory doesn't need to be tracked. It's kernelonly memory, used for internal bookkeeping only.
The only time userspace can directly manipulate stolen memory, is bymapping the pinned initial framebuffer to its own address space. Theonly allocation it can do is when a framebuffer is displayed, andframebuffer compression creates some stolen memory. Userspace is not
aware of this though, and has no way to manipulate those contents.
Stolen memory is irrelevant and not something cgroup controller knowsabout. Point is drivers say which memory regions they have and theirutilisation.
Imagine instead of stolen it said vram0, or on Intel multi-tile itshows local0 and local1. People working with containers are interestedto see this breakdown. I guess the parallel and use case here iscloser to memory.numa_stat.
Correct, but for the same reason, I think it might be more useful tosplit up the weight too.
A single scheduling weight for the global GPU might be less useful thanper engine, or per tile perhaps..

Yeah, there is some complexity there for sure and could be a largerwrite up. In short per engine stuff tends to work out in practice as isgiven how each driver can decide upon receiving the signal what to do.

In the i915 RFC for instance if it gets "over budget" signal from thegroup, but it sees that the physical engines belonging to this specificGPU are not over-subscribed, it simply omits any throttling. Which inpractice works out fine for two clients competing for different engines.Same would be for multiple GPUs (or tiles with our stuff) in the samecgroup.

Going back to the single scheduling weight or more fine grained. Wecould choose to follow for instance io.weight format? Start withdrm.weight being "default 1000" and later extend to per card (or more):


"""
default 100
card0 20
card1 50
"""

In this case I would convert drm.weight to this format straight away forthe next respin, just wouldn't support per card just yet.


Regards,

Tvrtko

Re: [PATCH 16/17] cgroup/drm: Expose memory stats

Reply via email to