On 2024-05-31 2:52, Christian König wrote:
> Am 31.05.24 um 00:02 schrieb Felix Kuehling:
>> On 2024-05-28 13:23, Yunxiang Li wrote:
>>> These functions are missing the lock for reset domain.
>>>
>>> Signed-off-by: Yunxiang Li <yunxiang...@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c               | 4 +++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c                | 8 ++++++--
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++++++--
>>>   3 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> index eb172388d99e..ddc5e9972da8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c
>>> @@ -34,6 +34,7 @@
>>>   #include <asm/set_memory.h>
>>>   #endif
>>>   #include "amdgpu.h"
>>> +#include "amdgpu_reset.h"
>>>   #include <drm/drm_drv.h>
>>>   #include <drm/ttm/ttm_tt.h>
>>>   @@ -401,13 +402,14 @@ void amdgpu_gart_invalidate_tlb(struct 
>>> amdgpu_device *adev)
>>>   {
>>>       int i;
>>>   -    if (!adev->gart.ptr)
>>> +    if (!adev->gart.ptr || !down_read_trylock(&adev->reset_domain->sem))
>>>           return;
>>>         mb();
>>>       amdgpu_device_flush_hdp(adev, NULL);
>>>       for_each_set_bit(i, adev->vmhubs_mask, AMDGPU_MAX_VMHUBS)
>>>           amdgpu_gmc_flush_gpu_tlb(adev, 0, i, 0);
>>> +    up_read(&adev->reset_domain->sem);
>>>   }
>>>     /**
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index e4742b65032d..52a3170d15b7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -307,8 +307,12 @@ static struct dma_fence *amdgpu_job_run(struct 
>>> drm_sched_job *sched_job)
>>>           dev_dbg(adev->dev, "Skip scheduling IBs in ring(%s)",
>>>               ring->name);
>>>       } else {
>>> -        r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
>>> -                       &fence);
>>> +        r = -ETIME;
>>> +        if (down_read_trylock(&adev->reset_domain->sem)) {
>>> +            r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>> +                           job, &fence);
>>> +            up_read(&adev->reset_domain->sem);
>>> +        }
>>>           if (r)
>>>               dev_err(adev->dev,
>>>                   "Error scheduling IBs (%d) in ring(%s)", r,
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> index 86ea610b16f3..21f5a1fb3bf8 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>> @@ -28,6 +28,7 @@
>>>   #include "kfd_priv.h"
>>>   #include "kfd_kernel_queue.h"
>>>   #include "amdgpu_amdkfd.h"
>>> +#include "amdgpu_reset.h"
>>>     static inline struct process_queue_node *get_queue_by_qid(
>>>               struct process_queue_manager *pqm, unsigned int qid)
>>> @@ -87,8 +88,12 @@ void kfd_process_dequeue_from_device(struct 
>>> kfd_process_device *pdd)
>>>           return;
>>>         dev->dqm->ops.process_termination(dev->dqm, &pdd->qpd);
>>> -    if (dev->kfd->shared_resources.enable_mes)
>>> -        amdgpu_mes_flush_shader_debugger(dev->adev, 
>>> pdd->proc_ctx_gpu_addr);
>>> +    if (dev->kfd->shared_resources.enable_mes &&
>>> + down_read_trylock(&dev->adev->reset_domain->sem)) {
>>> +        amdgpu_mes_flush_shader_debugger(dev->adev,
>>> +                         pdd->proc_ctx_gpu_addr);
>>> +
>>
>> It's not clear to me what's the requirement for reset domain locking around 
>> MES calls. We have a lot more of them in kfd_device_queue_manager.c (mostly 
>> calling adev->mes.funcs->... directly). Do they all need to be wrapped 
>> individually?
> 
> Whenever you call a MES function (or any other function directly interacting 
> with the rings or the HW registers) you need to make sure that at least the 
> read side of the reset lock is held.

Having to do that for each caller of amdgpu_mes functions seems error prone.

Would it make sense to wrap that inside amdgpu_mes_lock/unlock? Maybe turn it 
into amdgpu_mes_trylock/unlock and make sure that all the amdgpu_mes functions 
that take that lock can fail and return an error code. Add an attribute so the 
compiler can flag callers that ignore the return values. This would make it 
easier to let the compiler spot places that don't handle errors due to reset 
lock failures.

Regards,
  Felix

> 
> Regards,
> Christian.
> 
>>
>> Regards,
>>   Felix
>>
>>
>>> up_read(&dev->adev->reset_domain->sem);
>>> +    }
>>>       pdd->already_dequeued = true;
>>>   }
> 

Reply via email to