On 22.09.25 14:49, Tvrtko Ursulin wrote: > > On 22/09/2025 10:31, Christian König wrote: >> On 19.09.25 15:11, Tvrtko Ursulin wrote: >>> GPUs typically benefit from contiguous memory via reduced TLB pressure and >>> improved caching performance, where the maximum size of contiguous block >>> which adds a performance benefit is related to hardware design. >>> >>> TTM pool allocator by default tries (hard) to allocate up to the system >>> MAX_PAGE_ORDER blocks. This varies by the CPU platform and can also be >>> configured via Kconfig. >>> >>> If that limit was set to be higher than the GPU can make an extra use of, >>> lets allow the individual drivers to let TTM know over which allocation >>> order can the pool allocator afford to make a little bit less effort with. >>> >>> We implement this by disabling direct reclaim for those allocations, which >>> reduces the allocation latency and lowers the demands on the page >>> allocator, in cases where expending this effort is not critical for the >>> GPU in question. >>> >>> Signed-off-by: Tvrtko Ursulin <[email protected]> >>> Cc: Christian König <[email protected]> >>> Cc: Thadeu Lima de Souza Cascardo <[email protected]> >>> --- >>> drivers/gpu/drm/ttm/ttm_pool.c | 15 +++++++++++++-- >>> include/drm/ttm/ttm_pool.h | 10 ++++++++++ >>> 2 files changed, 23 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c >>> index c5eb2e28ca9d..3bf7b6bd96a3 100644 >>> --- a/drivers/gpu/drm/ttm/ttm_pool.c >>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c >>> @@ -726,8 +726,16 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, >>> struct ttm_tt *tt, >>> page_caching = tt->caching; >>> allow_pools = true; >>> - for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc); >>> - alloc->remaining_pages; >>> + >>> + order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc); >>> + /* >>> + * Do not add latency to the allocation path for allocations orders >>> + * device tolds us do not bring additional performance gains. >>> + */ >>> + if (order > pool->max_beneficial_order) >>> + gfp_flags &= ~__GFP_DIRECT_RECLAIM; >>> + >>> + for (; alloc->remaining_pages; >> >> Move that into ttm_pool_alloc_page(), the other code to adjust the gfp_flags >> based on the order is there as well. > > I can do that, no problem. > >> >>> order = ttm_pool_alloc_find_order(order, alloc)) { >>> struct ttm_pool_type *pt; >>> @@ -745,6 +753,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, >>> struct ttm_tt *tt, >>> if (!p) { >>> page_caching = ttm_cached; >>> allow_pools = false; >>> + if (order <= pool->max_beneficial_order) >>> + gfp_flags |= __GFP_DIRECT_RECLAIM; >> >> That makes this superfluous as well. >> >>> p = ttm_pool_alloc_page(pool, gfp_flags, order); >>> } >>> /* If that fails, lower the order if possible and retry. */ >>> @@ -1076,6 +1086,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct >>> device *dev, >>> pool->nid = nid; >>> pool->use_dma_alloc = use_dma_alloc; >>> pool->use_dma32 = use_dma32; >>> + pool->max_beneficial_order = MAX_PAGE_ORDER; >>> for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { >>> for (j = 0; j < NR_PAGE_ORDERS; ++j) { >>> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h >>> index 54cd34a6e4c0..24d3285c9aad 100644 >>> --- a/include/drm/ttm/ttm_pool.h >>> +++ b/include/drm/ttm/ttm_pool.h >>> @@ -66,6 +66,7 @@ struct ttm_pool_type { >>> * @nid: which numa node to use >>> * @use_dma_alloc: if coherent DMA allocations should be used >>> * @use_dma32: if GFP_DMA32 should be used >>> + * @max_beneficial_order: allocations above this order do not bring >>> performance gains >>> * @caching: pools for each caching/order >>> */ >>> struct ttm_pool { >>> @@ -74,6 +75,7 @@ struct ttm_pool { >>> bool use_dma_alloc; >>> bool use_dma32; >>> + unsigned int max_beneficial_order; >>> struct { >>> struct ttm_pool_type orders[NR_PAGE_ORDERS]; >>> @@ -88,6 +90,14 @@ void ttm_pool_init(struct ttm_pool *pool, struct device >>> *dev, >>> int nid, bool use_dma_alloc, bool use_dma32); >>> void ttm_pool_fini(struct ttm_pool *pool); >>> +static inline unsigned int >>> +ttm_pool_set_max_beneficial_order(struct ttm_pool *pool, unsigned int >>> order) >>> +{ >>> + pool->max_beneficial_order = min(MAX_PAGE_ORDER, order); >>> + >>> + return pool->max_beneficial_order; >>> +} >>> + >> >> Just make that a parameter to ttm_pool_init(), it should be static for all >> devices I know about anyway. > > I wanted to avoid changing signature of ttm_device_init(). > > But if will be are churning those, should I replace the two bool arguments in > ttm_pool_init() and ttm_device_init() with a pool init parameters structure > so two nameless booleans in a row are avoided? I do not feel strongly about > it either way though since there is not a huge number of callers.
Well we recently have figured out why we had to use GFP_DMA32 on all the older HW generations and 32bit boxes. The problem is that nobody has time to look into that and setup a test system so that we can finally remove that workaround. The other parameter should probably be replaced by using dma_addressing_limited() on the device so that we have that consistent for all devices as well. But here again nobody had time and interest to do that. So when we clean that stuff up then let's to it completely. Regards, Christian. > > Regards, > > Tvrtko > >> >> Apart from that looks good to me, >> Christian. >> >>> int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m); >>> void ttm_pool_drop_backed_up(struct ttm_tt *tt); >> >
