ttm: Improve protection in contended cases

Natalie Vock Mon, 15 Sep 2025 06:44:46 -0700

On 9/15/25 15:23, Christian König wrote:

On 15.09.25 15:17, Natalie Vock wrote:

On 9/15/25 14:48, Christian König wrote:

On 15.09.25 14:36, Natalie Vock wrote:

Hi all,


I've been looking into some cases where dmem protection fails to prevent
allocations from ending up in GTT when VRAM gets scarce and apps start
competing hard.

In short, this is because other (unprotected) applications end up
filling VRAM before protected applications do. This causes TTM to back
off and try allocating in GTT before anything else, and that is where
the allocation is placed in the end. The existing eviction protection
cannot prevent this, because no attempt at evicting is ever made
(although you could consider the backing-off as an immediate eviction to
GTT).


Well depending on what you gave as GEM flags from userspace that is expected 
behavior.

For applications using RADV we usually give GTT|VRAM as placement which 
basically tells the kernel that it shouldn't evict at all and immediately 
fallback to GTT.


Yeah, in general this behavior is completely expected - though I'd argue that 
protecting VRAM via dmemcg influences the semantics a little here.

Giving GTT|VRAM as placement from userspace essentially says "ok, please try allocating this 
in VRAM, but it's ok to fall back to GTT" - whereas specifying VRAM only essentially says 
"ok, please allocate this in VRAM, and really try hard to keep it in VRAM whatever the 
cost".

Usually, resource allocation failing is good enough of an indicator that it's 
not possible to allocate in VRAM. However, when the application's memory is 
protected by dmemcg, it essentially says that it actually should be possible to 
allocate up to that amount of memory - the cgroup is entitled to that memory, 
and the other unprotected cgroups have to make do with the rest.

I think it's a justifiable tradeoff between the indended function of VRAM|GTT 
and the intended function of dmem memory protection to evict these unprotected 
cgroups for only as long as the usage doesn't exceed the awarded protection - 
this is what this series implements (dropping the GTT flag in userspace would 
have negative implications in the case the app uses more memory than the 
protection afforded to it, and as I described, just letting protected memory 
allocations fall back to GTT is insufficient too).


Yeah, that is a really good point and the argumentation makes sense.

So the semantics of dmem should be that be that it basically guarantees that if 
the application requested x amount of VRAM that it gets at least x amount of 
VRAM.

The problem is where is that documented? This is part of UAPI so we pretty much 
need to nail down what should happen before we enforce it.


We do have (arguably terse) dmem cgroup documentation[1].

Perhaps we could outline this more explicitly, but the documentation forthe dmem.min/dmem.low interface files (that govern memory protection)mention that the same semantics as for memcg's memory.min/memory.low apply.


From memcg documentation there is this, for memory.low:
> Best-effort memory protection. If the memory usage of a cgroup is
> within its effective low boundary, the cgroup’s memory won’t be
> reclaimed unless there is no reclaimable memory available in
> unprotected cgroups.

dmem covers eviction instead of CPU memory reclaim, but if yous/reclaim/evict/g in the doc text, you'll get exactly the behavior wecurrently do: We first try to evict all unprotected memory, and memoryprotected by dmem.low is only touched if we need to evict even more.

Put shortly, an (effective) dmem.low value of x guarantees that you willget at least x bytes of vram, save for cases where the kernel reallyreally has to evict and has no other choice.

There is also memory.min and dmem.min, which prevents eviction under*any* circumstance. Of course, setting either of these (both dmem andmemcg) to a high value is dangerous because it may cause importantthings to crash from OOMs.

A lot of other things from the memcg documentation don't really apply todmem though (for example, we can't do evictions proportional to how muchunprotected memory apps are using), so you're probably right that thisdocumentation could be much improved. :)


Thanks,
Natalie

[1] https://docs.kernel.org/admin-guide/cgroup-v2.html#dmem


Regards,
Christian.


Thanks,
Natalie


Regards,
Christian.


This series tries to alleviate this by adding a special case when the
allocation is protected by cgroups: Instead of backing off immediately,
TTM will try evicting unprotected buffers from the domain to make space
for the protected one. This ensures that applications can actually use
all the memory protection awarded to them by the system, without being
prone to ping-ponging (only protected allocations can evict unprotected
ones, never the other way around).

The first two patches just add a few small utilities needed to implement
this to the dmem controller. The second two patches are the TTM
implementation:

"drm/ttm: Be more aggressive..." decouples cgroup charging from resource
allocation to allow us to hold on to the charge even if allocation fails
on first try, and adds a path to call ttm_bo_evict_alloc when the
charged allocation falls within min/low protection limits.

"drm/ttm: Use common ancestor..." is a more general improvement in
correctly implementing cgroup protection semantics. With recursive
protection rules, unused memory protection afforded to a parent node is
transferred to children recursively, which helps protect entire
subtrees from stealing each others' memory without needing to protect
each cgroup individually. This doesn't apply when considering direct
siblings inside the same subtree, so in order to not break
prioritization between these siblings, we need to consider the
relationship of evictor and evictee when calculating protection.
In practice, this fixes cases where a protected cgroup cannot steal
memory from unprotected siblings (which, in turn, leads to eviction
failures and new allocations being placed in GTT).

Thanks,
Natalie

Signed-off-by: Natalie Vock <[email protected]>
---
Natalie Vock (4):
        cgroup/dmem: Add queries for protection values
        cgroup/dmem: Add dmem_cgroup_common_ancestor helper
        drm/ttm: Be more aggressive when allocating below protection limit
        drm/ttm: Use common ancestor of evictor and evictee as limit pool

   drivers/gpu/drm/ttm/ttm_bo.c       | 79 
++++++++++++++++++++++++++++++++------
   drivers/gpu/drm/ttm/ttm_resource.c | 48 ++++++++++++++++-------
   include/drm/ttm/ttm_resource.h     |  6 ++-
   include/linux/cgroup_dmem.h        | 25 ++++++++++++
   kernel/cgroup/dmem.c               | 73 +++++++++++++++++++++++++++++++++++
   5 files changed, 205 insertions(+), 26 deletions(-)
---
base-commit: f3e82936857b3bd77b824ecd2fa7839dd99ec0c6
change-id: 20250915-dmemcg-aggressive-protect-5cf37f717cdb

Best regards,

Re: [PATCH 0/4] cgroup/dmem, drm/ttm: Improve protection in contended cases

Reply via email to