Am 2021-08-12 um 11:49 a.m. schrieb Zhang, Yifan:
> [AMD Official Use Only]
>
> Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute 
> call. There is time slot when Process vma has been unmapped from CPU side, 
> but prange is not removed in deferred list. If get-attribute is called  in 
> this time slot, then there will be a problem.
>
> This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after 
> KFDSVMRangeTest.BasicSystemMemTest test.
>
> TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
>      TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
>      TEST_START(TESTPROFILE_RUNALL)
> ......
>      HSAint32 enable = -1;
>      EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
>      expectedDefaultResults[4] = 
> (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
>      sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the 
> last test case since the addr is already munmap for CPU.
>      char *pBuf = sysBuffer->As<char *>();
>  
>      LOG() << "Get default atrributes" << std::endl;
>      memcpy(outputAttributes, inputAttributes, nAttributes * 
> sizeof(HSA_SVM_ATTRIBUTE));
>      EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the 
> problematic get-attribute. It returns the the prange attributes in the last 
> test case since prange is not removed.
>                                      nAttributes, outputAttributes));
>
> sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should 
> return the default attributes. But there are some corner cases, deferred list 
> from last test case is not finished when get attribute is called, so svm 
> range list remains the contents of the last test. In such cases, 
> hsaKmtSVMGetAttr returns the attributes in the last test instead of default 
> values. (The two test cases are in the same process, mmap in HsaSVMRange 
> constructor return the same addr) and makes test case fail.

OK. Thanks for investigating further. I think your patch is the the
correct solution after all. I'd just add a carefully worded comment:

/* Flush pending deferred work to avoid racing with deferred actions from
 * previous memory map changes (e.g. munmap). Concurrent memory map changes
 * can still race with get_attr because we don't hold the mmap lock. But that
 * would be a race condition in the application anyway, and undefined
 * behaviour is acceptable in that case.
 */

With that, the patch is

Reviewed-by: Felix Kuehling <felix.kuehl...@amd.com>

Regards,
  Felix


>
>
> -----Original Message-----
> From: Kuehling, Felix <felix.kuehl...@amd.com> 
> Sent: Tuesday, August 10, 2021 11:57 PM
> To: Zhang, Yifan <yifan1.zh...@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: fix random 
> KFDSVMRangeTest.SetGetAttributesTest test failure
>
> Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
>> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>>
>> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 
>> tests from 1 test case.
>> [----------] Global test environment set-up.
>> [----------] 18 tests from KFDSVMRangeTest
>> [ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
>> [       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
>> [ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
>> [          ] Get default atrributes
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4
>> Expected: outputAttributes[i].type
>> Which is: 2
>> [          ] Setting/Getting atrributes
>> [  FAILED  ]
>>
>> the root cause is that svm work queue has not finished when 
>> svm_range_get_attr is called, thus some garbage svm interval tree data 
>> make svm_range_get_attr get wrong result. Flush work queue before iterate 
>> svm interval tree.
>>
>> Signed-off-by: Yifan Zhang <yifan1.zh...@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index f811a3a24cd2..192e9401bed5 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t 
>> start, uint64_t size,
>>      pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>>               start + size - 1, nattr);
>>  
>> +    /* flush pending deferred work */
>> +    flush_work(&p->svms.deferred_list_work);
>> +
> There is still a race condition here. More work can be added to the 
> deferred_list_work after the flush call.
>
> Work gets added to the deferred_list asynchronously, for example in MMU 
> notifiers. Trying to synchronize with asynchronous events is inherently 
> problematic. It appears that the test is making some assumptions about things 
> happening asynchronously (page faults or MMU notifiers) and that's probably a 
> problem with the test, not with the driver.
>
> Alternatively, there may be a problem with a set-attribute call that leaves 
> some operations on the deferred list and results in unexpected get-attribute 
> results. If that's the problem, we may need to add a flush-call to the end of 
> the set-attributes function.
>
> Can you provide more details about the exact sequence of set-attribute and 
> get-attribute calls that is causing the problem?
>
> Regards,
>   Felix
>
>
>>      mmap_read_lock(mm);
>>      r = svm_range_is_valid(p, start, size);
>>      mmap_read_unlock(mm);

Reply via email to