Hello Thomas, Den 2026-05-26 kl. 11:24, skrev Thomas Hellström: > Hi, Maarten, > > Thanks for reviewing. > > On Tue, 2026-05-26 at 10:27 +0200, Maarten Lankhorst wrote: >> Hello, >> >> Den 2026-05-12 kl. 10:24, skrev Thomas Hellström: >>> Add an optional reclaim callback to struct dmem_cgroup_region. When >>> dmem.max is set below the current usage of a cgroup pool, the new >>> limit >>> is applied immediately (so that concurrent allocations are >>> throttled >>> while reclaim is in progress) and then the driver is asked to evict >>> memory to bring usage back below the limit. >>> >>> Reclaim is attempted up to a bounded number of times. No error is >>> returned to userspace if usage remains above the limit after >>> reclaim, >>> and a pending signal will abort the reclaim loop early. This >>> matches >>> the behavior of memory.max in the memory cgroup controller. >>> >>> Also honor O_NONBLOCK so that if that flag is set during the >>> max value write, no reclaim is initiated. The idea is to avoid >>> charging the reclaim cost to the writer of the max value. >>> >>> v2: >>> - Write max before reclaim is attempted (Maarten) >>> - Let signals abort the reclaim without error (Maarten) >>> - If a new max value is written with the O_NONBLOCK flag, >>> reclaim is not attempted (Maarten) >>> - Extract region from the pool parameter rather than >>> passing it explicitly to set_resource_xxx(). >>> v3: >>> - Use an rwsem to protect reclaim callback registration and >>> region unregister against concurrent reclaim invocations, >>> ensuring reclaim_priv is visible when the callback is >>> invoked. (Sashiko-bot) >>> >>> Assisted-by: GitHub_Copilot:claude-sonnet-4.6 >>> Signed-off-by: Thomas Hellström <[email protected]> >>> --- >>> include/linux/cgroup_dmem.h | 24 ++++++++ >>> kernel/cgroup/dmem.c | 106 >>> +++++++++++++++++++++++++++++++++--- >>> 2 files changed, 121 insertions(+), 9 deletions(-) >>> >>> diff --git a/include/linux/cgroup_dmem.h >>> b/include/linux/cgroup_dmem.h >>> index dd4869f1d736..c3bce21cbe80 100644 >>> --- a/include/linux/cgroup_dmem.h >>> +++ b/include/linux/cgroup_dmem.h >>> @@ -14,6 +14,21 @@ struct dmem_cgroup_pool_state; >>> /* Opaque definition of a cgroup region, used internally */ >>> struct dmem_cgroup_region; >>> >>> +/** >>> + * typedef dmem_cgroup_reclaim_fn_t - Reclaim callback for a dmem >>> cgroup region. >>> + * @pool: The cgroup pool that needs memory reclaimed. >>> + * @target_bytes: Minimum number of bytes the driver should >>> attempt to free. >>> + * @priv: Private data registered with >>> dmem_cgroup_region_set_reclaim(). >>> + * >>> + * Called by the dmem cgroup controller when dmem.max is set below >>> the current >>> + * usage of @pool. The driver should evict at least @target_bytes >>> of memory >>> + * from @pool. May be called multiple times if usage remains above >>> the limit. >>> + * >>> + * Return: 0 if progress was made, negative error code otherwise. >>> + */ >>> +typedef int (*dmem_cgroup_reclaim_fn_t)(struct >>> dmem_cgroup_pool_state *pool, >>> + u64 target_bytes, void >>> *priv); >>> + >>> #if IS_ENABLED(CONFIG_CGROUP_DMEM) >>> struct dmem_cgroup_region *dmem_cgroup_register_region(u64 size, >>> const char *name_fmt, ...) __printf(2,3); >>> void dmem_cgroup_unregister_region(struct dmem_cgroup_region >>> *region); >>> @@ -26,6 +41,9 @@ bool dmem_cgroup_state_evict_valuable(struct >>> dmem_cgroup_pool_state *limit_pool, >>> bool ignore_low, bool >>> *ret_hit_low); >>> >>> void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state >>> *pool); >>> +void dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region >>> *region, >>> + dmem_cgroup_reclaim_fn_t >>> reclaim, >>> + void *priv); >>> #else >>> static inline __printf(2,3) struct dmem_cgroup_region * >>> dmem_cgroup_register_region(u64 size, const char *name_fmt, ...) >>> @@ -62,5 +80,11 @@ bool dmem_cgroup_state_evict_valuable(struct >>> dmem_cgroup_pool_state *limit_pool, >>> static inline void dmem_cgroup_pool_state_put(struct >>> dmem_cgroup_pool_state *pool) >>> { } >>> >>> +static inline void >>> +dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region *region, >>> + dmem_cgroup_reclaim_fn_t reclaim, >>> + void *priv) >>> +{ } >>> + >>> #endif >>> #endif /* _CGROUP_DMEM_H */ >>> diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c >>> index 1ab1fb47f271..5fd5a1634d21 100644 >>> --- a/kernel/cgroup/dmem.c >>> +++ b/kernel/cgroup/dmem.c >>> @@ -51,6 +51,20 @@ struct dmem_cgroup_region { >>> * No new pools should be added to the region afterwards. >>> */ >>> bool unregistered; >>> + >>> + /** >>> + * @reclaim: Optional callback invoked when dmem.max is >>> set below the >>> + * current usage of a pool. The driver should attempt to >>> free at least >>> + * @target_bytes from @pool. May be called multiple times >>> if usage >>> + * remains above the limit after returning. >>> + */ >>> + dmem_cgroup_reclaim_fn_t reclaim; >>> + >>> + /** @reclaim_priv: Private data passed to @reclaim. */ >>> + void *reclaim_priv; >>> + >>> + /** @unregister_sem: Protect @reclaim while it is running. >>> */ >>> + struct rw_semaphore unregister_sem; >>> }; >>> >>> struct dmemcg_state { >>> @@ -145,21 +159,58 @@ static void free_cg_pool(struct >>> dmem_cgroup_pool_state *pool) >>> } >>> >>> static void >>> -set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val) >>> +set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val, >>> bool nonblock) >>> { >>> page_counter_set_min(&pool->cnt, val); >>> } >>> >>> static void >>> -set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val) >>> +set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val, >>> bool nonblock) >>> { >>> page_counter_set_low(&pool->cnt, val); >>> } >>> >>> static void >>> -set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val) >>> +set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val, >>> bool nonblock) >>> { >>> - page_counter_set_max(&pool->cnt, val); >>> + struct dmem_cgroup_region *region = pool->region; >>> + >>> + /* >>> + * Always update the limit, even if usage currently >>> exceeds it. >>> + * Concurrent allocations will be throttled against the >>> new limit >>> + * while reclaim is in progress. >>> + */ >>> + xchg(&pool->cnt.max, (unsigned long)val); >>> + >>> + if (nonblock || !READ_ONCE(region->reclaim)) >>> + return; >>> + >>> + for (int retries = 5; retries > 0; retries--) { >> Where does 5 come from? This code should retry until no longer above >> limit, otherwise you'll get some hard to debug issues. > > The memcg controller uses MAX_RECLAIM_RETRIES, although if that fails, > it will invoke the OOM killer, although if the reclaim callback makes > progress, it will not consume a retry. Perhaps we should adopt the same > behaviour except the OOM killer, at least for now. Note that if a > signal is pending, the reclaim attempt is abandoned both here and in > memcg.
Yeah, it would be good if we could re-use the same MAX_RECLAIM_RETRIES, so we will at least know where it's coming from. Perhaps move it to a slightly more public header. I'm thinking linux/mm.h, but perhaps the mm maintainers will prefer another place. It makes sense to eventually fail on no progress, especially if all remaining memory is pinned. This can happen when we pin dma-buf for example. Adding an entire DMEM OOM killer would be interesting, but probably far outside the scope of dmemcg for the foreseeable future. It's only unfortunate that we have no way of reporting or logging failure at all right now. >>> <snip> > > Let me take a look at this. > /Thomas Thanks for looking into it. Kind regards, ~Maarten Lankhorst
