On 2025-10-15 18:46, Chen, Xiaogang wrote:
On 10/15/2025 4:45 PM, Philip Yang wrote:
On 2025-10-15 17:01, Chen, Xiaogang wrote:
On 10/15/2025 3:11 PM, Philip Yang wrote:
Only show warning message if process mm is still alive when queue
buffer is freed to evcit the queues.
If kfd_lookup_process_by_mm return NULL, means the process is already
exited and mm is gone, it is fine to free queue buffer.
But another question is why a prange is still alive, its kfd process
is gone?
It is application process exited, kfd process structure still exist
and available. The issue is race condition:
do_exit
exit_mmap
a. mmu mm release notifier, schedule kfd release wq to
destroy queue
unmap_vmas
b. mmu_notifier_range(.. MMU_NOTIFY_UNMAP...)
the step b is executed to unmap CWSR svm range, before step a kfd
release wq destroy queue.
When unmap a prange the queues that use it should have been stopped.
If not, there is problem somewhere. This warning message need be
sent no matter kfd process exists or not.
I think a real problem here is kfd process need be alive as long as
any of its resource is still alive. In this case since prange is
still alive its kfd process should not be released(p should not be
null). If not we need wait all pranges from this process got
released, then release this kfd process.
kfd process structure is freed in kfd_process_wq_release after
svm_range_list_fini.
I wanted to say: delay remove kfd process p from kfd_processes_table
until all resources of p got released. So when any p's resources is
getting released p is available. That needs change kfd process release
logic.
prange->queue_refcount will be 0 after queue is destroyed (not evicted),
we should warn user space and evict queues if prange is freed with
prange->queue_refcount not zero. This patch is to fix the race that
generate false warning after process exited to free prange. I don't
think that keep kfd process in kfd_processes_table after mmu release
notifier will solve this race issue.
Regards,
Philip
Regards
Xiaogang
Regards,
Philip
Regards
Xiaogang
Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory
residency")
Signed-off-by: Philip Yang <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 4d4a47313f5b..d1b2f8525f80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2487,7 +2487,9 @@ svm_range_unmap_from_cpu(struct mm_struct
*mm, struct svm_range *prange,
bool unmap_parent;
uint32_t i;
- if (atomic_read(&prange->queue_refcount)) {
+ p = kfd_lookup_process_by_mm(mm);
+
+ if (p && atomic_read(&prange->queue_refcount)) {
int r;
pr_warn("Freeing queue vital buffer 0x%lx, queue
evicted\n",
@@ -2497,7 +2499,6 @@ svm_range_unmap_from_cpu(struct mm_struct
*mm, struct svm_range *prange,
pr_debug("failed %d to quiesce KFD queues\n", r);
}
- p = kfd_lookup_process_by_mm(mm);
if (!p)
return;
svms = &p->svms;