On 3/2/26 20:35, T.J. Mercier wrote: > On Mon, Mar 2, 2026 at 7:51 AM Christian König <[email protected]> > wrote: >> >> On 3/2/26 16:40, Shakeel Butt wrote: >>> +TJ >>> >>> On Mon, Mar 02, 2026 at 03:37:37PM +0100, Christian König wrote: >>>> On 3/2/26 15:15, Shakeel Butt wrote: >>>>> On Wed, Feb 25, 2026 at 10:09:55AM +0100, Christian König wrote: >>>>>> On 2/24/26 20:28, Dave Airlie wrote: >>>>> [...] >>>>>> >>>>>>> This has been a pain in the ass for desktop for years, and I'd like to >>>>>>> fix it, the HPC use case if purely a driver for me doing the work. >>>>>> >>>>>> Wait a second. How does accounting to cgroups help with that in any way? >>>>>> >>>>>> The last time I looked into this problem the OOM killer worked based on >>>>>> the per task_struct stats which couldn't be influenced this way. >>>>>> >>>>> >>>>> It depends on the context of the oom-killer. If the oom-killer is >>>>> triggered due >>>>> to memcg limits then only the processes in the scope of the memcg will be >>>>> targetted by the oom-killer. With the specific setting, the oom-killer >>>>> can kill >>>>> all the processes in the target memcg. >>>>> >>>>> However nowadays the userspace oom-killer is preferred over the kernel >>>>> oom-killer due to flexibility and configurability. Userspace oom-killers >>>>> like >>>>> systmd-oomd, Android's LMKD or fb-oomd are being used in containerized >>>>> environments. Such oom-killers looks at memcg stats and hiding something >>>>> something from memcg i.e. not charging to memcg will hide such usage from >>>>> these >>>>> oom-killers. >>>> >>>> Well exactly that's the problem. Android's oom killer is *not* using memcg >>>> exactly because of this inflexibility. >>> >>> Are you sure Android's oom killer is not using memcg? From what I see in the >>> documentation [1], it requires memcg. > > LMKD used to use memcg v1 for memory.pressure_level, but that has been > replaced by PSI which is now the default configuration. I deprecated > all configurations with memcg v1 dependencies in January. We plan to > remove the memcg v1 support from LMKD when the 5.10 and 5.15 kernels > reach EOL. > >> My bad, I should have been wording that better. >> >> The Android OOM killer is not using memcg for tracking GPU memory >> allocations, because memcg doesn't have proper support for tracking shared >> buffers. >> >> In other words GPU memory allocations are shared by design and it is the >> norm that the process which is using it is not the process which has >> allocated it. >> >> What we would need (as a start) to handle all of this with memcg would be to >> accounted the resources to the process which referenced it and not the one >> which allocated it. >> >> I can give a full list of requirements which would be needed by cgroups to >> cover all the different use cases, but it basically means tons of extra >> complexity. > > Yeah this is right. We usually prioritize fast kills rather than > picking the biggest offender though. Application state (foreground / > background) is the primary selector, however LMKD does have a mode > (kill_heaviest_task) where it will pick the largest task within a > group of apps sharing the same application state. For this it uses RSS > from /proc/<pid>/statm, and (prepare to avert your eyes) a new and out > of tree interface in procfs for accounting dmabufs used by a process. > It tracks FD references and map references as they come and go, and > only counts any buffer once for a process regardless of the number and > type of references a process has to the same buffer. I dislike it > greatly.
*sigh* I was really hoping that we would have nailed it with the BPF support for DMA-buf and not rely on out of tree stuff any more. We should really stop re-inventing the wheel over and over again and fix the shortcomings cgroups has instead and then use that one. > My original intention was to use the dmabuf BPF iterator we added to > scan maps and FDs of a process for dmabufs on demand. Very simple and > pretty fast in BPF. This wouldn't support high watermark tracking, so > I was forced into doing something else for per-process accounting. To > be fair, the HWM tracking has detected a few application bugs where > 4GB of system memory was inadvertently consumed by dmabufs. > > The BPF iterator is currently used to support accounting of buffers > not visible in userspace (dmabuf_dump / libdmabufinfo) and it's a nice > improvement for that over the old sysfs interface. I hope to replace > the slow scanning of procfs for dmabufs in libdmabufinfo with BPF > programs that use the dmabuf iterator, but that's not a priority for > this year. > > Independent of all of that, memcg doesn't really work well for this > because it's shared memory that can only be attributed to a single > memcg, and the most common allocator (gralloc) is in a separate > process and memcg than the processes using the buffers (camera, > YouTube, etc.). I had a few patches that transferred the ownership of > buffers to a new memcg when they were sent via Binder, but this used > the memcg v1 charge moving functionality which is now gone because it > was so complicated. But that only works if there is one user that > should be charged for the buffer anyway. What if it is shared by > multiple applications and services? Well the "usual" (e.g. what you find in the literature and what other operating systems do) approach is to use a proportional set size instead of the resident set size: https://en.wikipedia.org/wiki/Proportional_set_size The problem is that a proportional set size is usually harder to come by. So it means additional overhead, more complex interfaces etc... Regards, Christian. > >> Regards, >> Christian. >> >>> >>> [1] https://source.android.com/docs/core/perf/lmkd >>> >>>> >>>> See the multiple iterations we already had on that topic. Even including >>>> reverting already upstream uAPI. >>>> >>>> The latest incarnation is that BPF is used for this task on Android. >>>> >>>> Regards, >>>> Christian. >>
