On 07/11/2025 15:12, Kim, Jonathan wrote:
[Public]

-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Friday, November 7, 2025 9:36 AM
To: Kim, Jonathan <[email protected]>
Cc: [email protected]; Kuehling, Felix <[email protected]>;
Six, Lancelot <[email protected]>; Yang, Philip <[email protected]>
Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area

Caution: This message originated from an External Source. Use proper caution
when opening attachments, clicking links, or responding.


On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <[email protected]> wrote:

Over allocation of save area is not fatal, only under allocation is.
ROCm has various components that independently claim authority over save
area size.

Unless KFD decides to claim single authority, relax size checks with a
warning on over allocation.

Do we want any sort of upper limit?

Mmm.  I'd expect early failure on unreasonable user request for over allocation 
at the allocation stage prior to acquire on queue create stage and acquire 
should be bound to what was registered/allocated.
So I'm not sure what an upper bound acquire limit check could serve.


One thing to consider is that at the end of the CWSR area, there is space dedicated for the debugger (used to implement displaced-stepping buffers). Initially, when KFD did not check the CWSR area size, it was just a userspace matter to increase that size if the debugger needed extensions. With this new change, we can't extend this buffer without a KFD update.

Having a "check that the buffer the thunk provided is big enough" policy would still allow us to extend this allocation if necessary.

Best,
Lancelot.



Signed-off-by: Jonathan Kim <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
  1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index a65c67cf56ff..448043bc2937 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct
kfd_process_device *pdd, struct queue_prope
                 goto out_err_unreserve;
         }

-       if (properties->ctx_save_restore_area_size != topo_dev-
node_props.cwsr_size) {
-               pr_debug("queue cwsr size 0x%x not equal to node cwsr size 
0x%x\n",
+       if (properties->ctx_save_restore_area_size < topo_dev-
node_props.cwsr_size) {
+               pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
0x%x\n",
                         properties->ctx_save_restore_area_size,
                         topo_dev->node_props.cwsr_size);
                 err = -EINVAL;
                 goto out_err_unreserve;
         }

-       total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
node_props.debug_memory_size)
-                         * NUM_XCC(pdd->dev->xcc_mask);
+       if (properties->ctx_save_restore_area_size > topo_dev-
node_props.cwsr_size)
+               pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended
node cwsr size 0x%x\n",
+                                   properties->ctx_save_restore_area_size,
+                                   topo_dev->node_props.cwsr_size);

We can probably drop the warning here.

Acked.

Thanks.

Jon


Alex

+
+       total_cwsr_size = (properties->ctx_save_restore_area_size +
+                          topo_dev->node_props.debug_memory_size) * 
NUM_XCC(pdd-
dev->xcc_mask);
         total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);

         err = kfd_queue_buffer_get(vm, (void *)properties-
ctx_save_restore_area_address,
@@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct
kfd_process_device *pdd, struct queue_prope
         topo_dev = kfd_topology_device_by_id(pdd->dev->id);
         if (!topo_dev)
                 return -EINVAL;
-       total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
node_props.debug_memory_size)
-                         * NUM_XCC(pdd->dev->xcc_mask);
+       total_cwsr_size = (properties->ctx_save_restore_area_size +
+                          topo_dev->node_props.debug_memory_size) * 
NUM_XCC(pdd-
dev->xcc_mask);
         total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);

         kfd_queue_buffer_svm_put(pdd, properties-
ctx_save_restore_area_address, total_cwsr_size);
--
2.34.1


Reply via email to