Am 2021-08-12 um 11:49 a.m. schrieb Zhang, Yifan: > [AMD Official Use Only] > > Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute > call. There is time slot when Process vma has been unmapped from CPU side, > but prange is not removed in deferred list. If get-attribute is called in > this time slot, then there will be a problem. > > This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after > KFDSVMRangeTest.BasicSystemMemTest test. > > TEST_F(KFDSVMRangeTest, SetGetAttributesTest) { > TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX); > TEST_START(TESTPROFILE_RUNALL) > ...... > HSAint32 enable = -1; > EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable)); > expectedDefaultResults[4] = > (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS; > sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the > last test case since the addr is already munmap for CPU. > char *pBuf = sysBuffer->As<char *>(); > > LOG() << "Get default atrributes" << std::endl; > memcpy(outputAttributes, inputAttributes, nAttributes * > sizeof(HSA_SVM_ATTRIBUTE)); > EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the > problematic get-attribute. It returns the the prange attributes in the last > test case since prange is not removed. > nAttributes, outputAttributes)); > > sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should > return the default attributes. But there are some corner cases, deferred list > from last test case is not finished when get attribute is called, so svm > range list remains the contents of the last test. In such cases, > hsaKmtSVMGetAttr returns the attributes in the last test instead of default > values. (The two test cases are in the same process, mmap in HsaSVMRange > constructor return the same addr) and makes test case fail.
OK. Thanks for investigating further. I think your patch is the the correct solution after all. I'd just add a carefully worded comment: /* Flush pending deferred work to avoid racing with deferred actions from * previous memory map changes (e.g. munmap). Concurrent memory map changes * can still race with get_attr because we don't hold the mmap lock. But that * would be a race condition in the application anyway, and undefined * behaviour is acceptable in that case. */ With that, the patch is Reviewed-by: Felix Kuehling <felix.kuehl...@amd.com> Regards, Felix > > > -----Original Message----- > From: Kuehling, Felix <felix.kuehl...@amd.com> > Sent: Tuesday, August 10, 2021 11:57 PM > To: Zhang, Yifan <yifan1.zh...@amd.com>; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdkfd: fix random > KFDSVMRangeTest.SetGetAttributesTest test failure > > Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang: >> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test. >> >> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 >> tests from 1 test case. >> [----------] Global test environment set-up. >> [----------] 18 tests from KFDSVMRangeTest >> [ RUN ] KFDSVMRangeTest.BasicSystemMemTest >> [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms) >> [ RUN ] KFDSVMRangeTest.SetGetAttributesTest >> [ ] Get default atrributes >> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154 >> : Failure Value of: expectedDefaultResults[i] >> Actual: 4294967295 >> Expected: outputAttributes[i].value >> Which is: 0 >> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154 >> : Failure Value of: expectedDefaultResults[i] >> Actual: 4294967295 >> Expected: outputAttributes[i].value >> Which is: 0 >> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152 >> : Failure Value of: expectedDefaultResults[i] >> Actual: 4 >> Expected: outputAttributes[i].type >> Which is: 2 >> [ ] Setting/Getting atrributes >> [ FAILED ] >> >> the root cause is that svm work queue has not finished when >> svm_range_get_attr is called, thus some garbage svm interval tree data >> make svm_range_get_attr get wrong result. Flush work queue before iterate >> svm interval tree. >> >> Signed-off-by: Yifan Zhang <yifan1.zh...@amd.com> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> index f811a3a24cd2..192e9401bed5 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t >> start, uint64_t size, >> pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start, >> start + size - 1, nattr); >> >> + /* flush pending deferred work */ >> + flush_work(&p->svms.deferred_list_work); >> + > There is still a race condition here. More work can be added to the > deferred_list_work after the flush call. > > Work gets added to the deferred_list asynchronously, for example in MMU > notifiers. Trying to synchronize with asynchronous events is inherently > problematic. It appears that the test is making some assumptions about things > happening asynchronously (page faults or MMU notifiers) and that's probably a > problem with the test, not with the driver. > > Alternatively, there may be a problem with a set-attribute call that leaves > some operations on the deferred list and results in unexpected get-attribute > results. If that's the problem, we may need to add a flush-call to the end of > the set-attributes function. > > Can you provide more details about the exact sequence of set-attribute and > get-attribute calls that is causing the problem? > > Regards, > Felix > > >> mmap_read_lock(mm); >> r = svm_range_is_valid(p, start, size); >> mmap_read_unlock(mm);