Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)

Christian König Mon, 02 Mar 2026 11:36:48 -0800

On 3/2/26 18:16, Shakeel Butt wrote:
> On Mon, Mar 02, 2026 at 04:51:12PM +0100, Christian König wrote:
>> On 3/2/26 16:40, Shakeel Butt wrote:
>>> +TJ
>>>
>>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote:
>>>> On 3/2/26 15:15, Shakeel Butt wrote:
>>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote:
>>>>>> On 2/24/26 20:28, Dave Airlie wrote:
>>>>> [...]
>>>>>>
>>>>>>> This has been a pain in the ass for desktop for years, and I'd like to
>>>>>>> fix it, the HPC use case if purely a driver for me doing the work.
>>>>>>
>>>>>> Wait a second. How does accounting to cgroups help with that in any way?
>>>>>>
>>>>>> The last time I looked into this problem the OOM killer worked based on 
>>>>>> the per task_struct stats which couldn't be influenced this way.
>>>>>>
>>>>>
>>>>> It depends on the context of the oom-killer. If the oom-killer is 
>>>>> triggered due
>>>>> to memcg limits then only the processes in the scope of the memcg will be
>>>>> targetted by the oom-killer. With the specific setting, the oom-killer 
>>>>> can kill
>>>>> all the processes in the target memcg.
>>>>>
>>>>> However nowadays the userspace oom-killer is preferred over the kernel
>>>>> oom-killer due to flexibility and configurability. Userspace oom-killers 
>>>>> like
>>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized
>>>>> environments. Such oom-killers looks at memcg stats and hiding something
>>>>> something from memcg i.e. not charging to memcg will hide such usage from 
>>>>> these
>>>>> oom-killers.
>>>>
>>>> Well exactly that's the problem. Android's oom killer is *not* using memcg 
>>>> exactly because of this inflexibility.
>>>
>>> Are you sure Android's oom killer is not using memcg? From what I see in the
>>> documentation [1], it requires memcg.
>>
>> My bad, I should have been wording that better.
>>
>> The Android OOM killer is not using memcg for tracking GPU memory 
>> allocations, because memcg doesn't have proper support for tracking shared 
>> buffers.
> 
> Yes indeed memcg is bad with buffers shared between memcgs (shmem, shared
> filesystems).


My big concern is that we create uAPI which we then (again) find 6 month later 
as blocker to further development and have to stick with it.

That has happened before and that we could remove the initial DMA-buf sysfs 
uAPI (for example) was just because Greg and T.J. agreed that the interface is 
not something we can carry on into the future.

>>
>> In other words GPU memory allocations are shared by design and it is the 
>> norm that the process which is using it is not the process which has 
>> allocated it.
> 
> Here the GPU memory can be system memory or the actual memory on GPU, right?

For embedded, mobile GPUs (Android) or APUs (modern laptops) it is system 
memory.

For dGPUs on desktop environments it is mostly device memory on GPUs, but 
system memory is used as swap.

In HPC, cloud computing and automotive use cases you have a mixture of both 
system memory and device memory.

> I think I discussed with TJ on the possibility of moving the allocations in 
> the
> context of process using through custom fault handler in GPU drivers. I don't
> remember the conclusion but I am assuming that is not possible.

Most HW still uses pre-allocated memory and can't do things like fault on 
demand. That's why we allocate everything at once on creation or at most on 
first use.

But allocate on first use has the big potential for security problems. E.g. 
imagine you create a 1GiB buffer and send it to your display server as window 
content, the display server would be charged because it is the first one 
touching it but you keep the memory as reference for yourself.

For Android the only way out which has similar functionality as the BPF 
approach is to change charging when the file descriptors used as reference for 
the memory is transferred between processes.

For some automotive use cases it is even worse. To fully handle that use case 
multiple different cgroups in the same process would be needed, e.g. different 
cgroups for different threads and/or client handles in the same QEMU process.

Long story short: It is a mess.

>>
>> What we would need (as a start) to handle all of this with memcg would be to 
>> accounted the resources to the process which referenced it and not the one 
>> which allocated it.
> 
> Irrespective of memcg charging decision, one of my request would be to at 
> least
> have global counters for the GPU memory which this series is adding. That 
> would
> be very similar to NR_KERNEL_FILE_PAGES where we explicit opt-out of memcg
> charging but keep the global counter, so the admin can identify the reasons
> behind high unaccounted memory on the system.

Sounds reasonable to me. I will try to give this set another review round.

Regards,
Christian.

> 
>>
>> I can give a full list of requirements which would be needed by cgroups to 
>> cover all the different use cases, but it basically means tons of extra 
>> complexity.
>>
>> Regards,
>> Christian.
>>
>>>
>>> [1] https://source.android.com/docs/core/perf/lmkd
>>>
>>>>
>>>> See the multiple iterations we already had on that topic. Even including 
>>>> reverting already upstream uAPI.
>>>>
>>>> The latest incarnation is that BPF is used for this task on Android.
>>>>
>>>> Regards,
>>>> Christian.
>>

Re: [PATCH 07/16] memcg: add support for GPU page counters. (v4)

Reply via email to