On 2025-10-15 18:46, Chen, Xiaogang wrote:

On 10/15/2025 4:45 PM, Philip Yang wrote:

On 2025-10-15 17:01, Chen, Xiaogang wrote:

On 10/15/2025 3:11 PM, Philip Yang wrote:
Only show warning message if process mm is still alive when queue
buffer is freed to evcit the queues.

If kfd_lookup_process_by_mm return NULL, means the process is already
exited and mm is gone, it is fine to free queue buffer.

But another question is why a prange is still alive, its kfd process is gone?
It is application process exited, kfd process structure still exist and available. The issue is race condition:

   do_exit
      exit_mmap
a.          mmu mm release notifier, schedule kfd release wq to destroy queue
             unmap_vmas
b.                mmu_notifier_range(.. MMU_NOTIFY_UNMAP...)

the step b is executed to unmap CWSR svm range, before step a kfd release wq destroy queue.



When unmap a prange the queues that use it should have been stopped. If not, there is problem somewhere. This warning message need be sent no matter kfd process exists or not.

I think a real problem here is kfd process need be alive as long as any of its resource is still alive. In this case since prange is still alive its kfd process should not be released(p should not be null). If not we need wait all pranges from this process got released, then release this kfd process.

kfd process structure is freed in kfd_process_wq_release after svm_range_list_fini.

I wanted to say: delay remove kfd process p from kfd_processes_table until all resources of p got released. So when any p's resources is getting released p is available. That needs change kfd process release logic.

That would complicate the cleanup a lot, because now other threads can still look up the kfd_process and use or modify it concurrently while the cleanup is happening. We remove the process from the kfd_processes_table first to ensure that it is safe to clean up all the process resources.

Regards,
  Felix




Regards

Xiaogang





Regards,

Philip


Regards

Xiaogang


Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory residency")
Signed-off-by: Philip Yang <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 4d4a47313f5b..d1b2f8525f80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
      bool unmap_parent;
      uint32_t i;
  -    if (atomic_read(&prange->queue_refcount)) {
+    p = kfd_lookup_process_by_mm(mm);
+
+    if (p && atomic_read(&prange->queue_refcount)) {
          int r;
            pr_warn("Freeing queue vital buffer 0x%lx, queue evicted\n", @@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct svm_range *prange,
              pr_debug("failed %d to quiesce KFD queues\n", r);
      }
  -    p = kfd_lookup_process_by_mm(mm);
      if (!p)
          return;
      svms = &p->svms;

Reply via email to