[Public]
Hi Felix,
Thanks for comments. I just found a similar race condition in KFD test. Sent it
in another thread "[PATCH] drm/amdkfd: fix
KFDSVMRangeTest.PartialUnmapSysMemTest fails"
BRs,
Yifan
-Original Message-
From: Kuehling, Felix
Sent: Friday, August 13, 2021 4:23 AM
To: Zhang, Yifan ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: fix random
KFDSVMRangeTest.SetGetAttributesTest test failure
Am 2021-08-12 um 11:49 a.m. schrieb Zhang, Yifan:
> [AMD Official Use Only]
>
> Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute
> call. There is time slot when Process vma has been unmapped from CPU side,
> but prange is not removed in deferred list. If get-attribute is called in
> this time slot, then there will be a problem.
>
> This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after
> KFDSVMRangeTest.BasicSystemMemTest test.
>
> TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
> TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
> TEST_START(TESTPROFILE_RUNALL)
> ..
> HSAint32 enable = -1;
> EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
> expectedDefaultResults[4] =
> (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
> sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the
> last test case since the addr is already munmap for CPU.
> char *pBuf = sysBuffer->As();
>
> LOG() << "Get default atrributes" << std::endl;
> memcpy(outputAttributes, inputAttributes, nAttributes *
> sizeof(HSA_SVM_ATTRIBUTE));
> EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the
> problematic get-attribute. It returns the the prange attributes in the last
> test case since prange is not removed.
> nAttributes, outputAttributes));
>
> sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should
> return the default attributes. But there are some corner cases, deferred list
> from last test case is not finished when get attribute is called, so svm
> range list remains the contents of the last test. In such cases,
> hsaKmtSVMGetAttr returns the attributes in the last test instead of default
> values. (The two test cases are in the same process, mmap in HsaSVMRange
> constructor return the same addr) and makes test case fail.
OK. Thanks for investigating further. I think your patch is the the correct
solution after all. I'd just add a carefully worded comment:
/* Flush pending deferred work to avoid racing with deferred actions from
* previous memory map changes (e.g. munmap). Concurrent memory map changes
* can still race with get_attr because we don't hold the mmap lock. But that
* would be a race condition in the application anyway, and undefined
* behaviour is acceptable in that case.
*/
With that, the patch is
Reviewed-by: Felix Kuehling
Regards,
Felix
>
>
> -Original Message-
> From: Kuehling, Felix
> Sent: Tuesday, August 10, 2021 11:57 PM
> To: Zhang, Yifan ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: fix random
> KFDSVMRangeTest.SetGetAttributesTest test failure
>
> Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
>> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>>
>> Note: Google Test filter = KFDSVMRangeTest.* [==] Running 18
>> tests from 1 test case.
>> [--] Global test environment set-up.
>> [--] 18 tests from KFDSVMRangeTest
>> [ RUN ] KFDSVMRangeTest.BasicSystemMemTest
>> [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
>> [ RUN ] KFDSVMRangeTest.SetGetAttributesTest
>> [ ] Get default atrributes
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:15
>> 4
>> : Failure Value of: expectedDefaultResults[i]
>> Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:15
>> 4
>> : Failure Value of: expectedDefaultResults[i]
>> Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:15
>> 2
>> : Failure Value of: expectedDefaultResults[i]
>> Actual: 4
>> Expected: outputAttributes[i].type
>> Which is: 2
>> [ ] Setting/Getting atrributes
>> [ FAILED ]
>>
>> the root cause is that svm work queue has not finished when
>> svm_range_get_attr is called, thus some garbage svm interval tree
>> data make svm_range_get_attr get wrong result. Flush work queue before
>> iterate svm interval tree.
>>
>> Signed-off-by: Yifan Zhang
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index f811a3a24cd2..192e9401bed5 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3072,6 +3072,9 @@ svm_range_get_a