On 22/09/2025 14:30, Christian König wrote:
On 22.09.25 14:49, Tvrtko Ursulin wrote:

On 22/09/2025 10:31, Christian König wrote:
On 19.09.25 15:11, Tvrtko Ursulin wrote:
GPUs typically benefit from contiguous memory via reduced TLB pressure and
improved caching performance, where the maximum size of contiguous block
which adds a performance benefit is related to hardware design.

TTM pool allocator by default tries (hard) to allocate up to the system
MAX_PAGE_ORDER blocks. This varies by the CPU platform and can also be
configured via Kconfig.

If that limit was set to be higher than the GPU can make an extra use of,
lets allow the individual drivers to let TTM know over which allocation
order can the pool allocator afford to make a little bit less effort with.

We implement this by disabling direct reclaim for those allocations, which
reduces the allocation latency and lowers the demands on the page
allocator, in cases where expending this effort is not critical for the
GPU in question.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Christian König <[email protected]>
Cc: Thadeu Lima de Souza Cascardo <[email protected]>
---
   drivers/gpu/drm/ttm/ttm_pool.c | 15 +++++++++++++--
   include/drm/ttm/ttm_pool.h     | 10 ++++++++++
   2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index c5eb2e28ca9d..3bf7b6bd96a3 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -726,8 +726,16 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, struct 
ttm_tt *tt,
         page_caching = tt->caching;
       allow_pools = true;
-    for (order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
-         alloc->remaining_pages;
+
+    order = ttm_pool_alloc_find_order(MAX_PAGE_ORDER, alloc);
+    /*
+     * Do not add latency to the allocation path for allocations orders
+     * device tolds us do not bring additional performance gains.
+     */
+    if (order > pool->max_beneficial_order)
+        gfp_flags &= ~__GFP_DIRECT_RECLAIM;
+
+    for (; alloc->remaining_pages;

Move that into ttm_pool_alloc_page(), the other code to adjust the gfp_flags 
based on the order is there as well.

I can do that, no problem.


            order = ttm_pool_alloc_find_order(order, alloc)) {
           struct ttm_pool_type *pt;
   @@ -745,6 +753,8 @@ static int __ttm_pool_alloc(struct ttm_pool *pool, 
struct ttm_tt *tt,
           if (!p) {
               page_caching = ttm_cached;
               allow_pools = false;
+            if (order <= pool->max_beneficial_order)
+                gfp_flags |= __GFP_DIRECT_RECLAIM;

That makes this superfluous as well.

               p = ttm_pool_alloc_page(pool, gfp_flags, order);
           }
           /* If that fails, lower the order if possible and retry. */
@@ -1076,6 +1086,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device 
*dev,
       pool->nid = nid;
       pool->use_dma_alloc = use_dma_alloc;
       pool->use_dma32 = use_dma32;
+    pool->max_beneficial_order = MAX_PAGE_ORDER;
         for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) {
           for (j = 0; j < NR_PAGE_ORDERS; ++j) {
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 54cd34a6e4c0..24d3285c9aad 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -66,6 +66,7 @@ struct ttm_pool_type {
    * @nid: which numa node to use
    * @use_dma_alloc: if coherent DMA allocations should be used
    * @use_dma32: if GFP_DMA32 should be used
+ * @max_beneficial_order: allocations above this order do not bring 
performance gains
    * @caching: pools for each caching/order
    */
   struct ttm_pool {
@@ -74,6 +75,7 @@ struct ttm_pool {
         bool use_dma_alloc;
       bool use_dma32;
+    unsigned int max_beneficial_order;
         struct {
           struct ttm_pool_type orders[NR_PAGE_ORDERS];
@@ -88,6 +90,14 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
              int nid, bool use_dma_alloc, bool use_dma32);
   void ttm_pool_fini(struct ttm_pool *pool);
   +static inline unsigned int
+ttm_pool_set_max_beneficial_order(struct ttm_pool *pool, unsigned int order)
+{
+    pool->max_beneficial_order = min(MAX_PAGE_ORDER, order);
+
+    return pool->max_beneficial_order;
+}
+

Just make that a parameter to ttm_pool_init(), it should be static for all 
devices I know about anyway.

I wanted to avoid changing signature of ttm_device_init().

But if will be are churning those, should I replace the two bool arguments in 
ttm_pool_init() and ttm_device_init() with a pool init parameters structure so 
two nameless booleans in a row are avoided? I do not feel strongly about it 
either way though since there is not a huge number of callers.

Well we recently have figured out why we had to use GFP_DMA32 on all the older 
HW generations and 32bit boxes. The problem is that nobody has time to look 
into that and setup a test system so that we can finally remove that workaround.

The other parameter should probably be replaced by using 
dma_addressing_limited() on the device so that we have that consistent for all 
devices as well. But here again nobody had time and interest to do that.

I am a bit uncertain which is which in your description because both GFP_DMA32 and dma_addressing_limited() appear to be about the same use_dma32 parameter?

So when we clean that stuff up then let's to it completely.

I am happy to write patches once I can understand the ask. Only that I will not be able to test them outside the hardware I have locally which is just one APU device (Steam Deck).

Regards,

Tvrtko

Reply via email to