Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy):
[AMD Official Use Only - AMD Internal Distribution Only]

+void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
+amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
+amdgpu_begin_hw_access(struct amdgpu_device *adev); void
+amdgpu_end_hw_access(struct amdgpu_device *adev);
Don't add anything to amdgpu.h. We are slowly decomposing that file.
Where would be a better place? I just wanted to have them next to 
amdgpu_in_reset

amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over there as well.


@@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
             goto skip_hw_reset;
     }

+   amdgpu_lock_hw_access(adev);
That should already be locked here. So this will most likely deadlock.

   retry:    /* Rest of adevs pre asic reset from XGMI hive. */
     list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
             r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
-5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
              */
             amdgpu_device_stop_pending_resets(tmp_adev);
     }
+   amdgpu_unlock_hw_access(adev);

   skip_hw_reset:
Don't add helpers for that, especially not with that name.

We don't block HW access, but just prevent concurrent resets.
Here is taking a different lock than the reset_domain->sem. It is a seperate 
reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not 
taken in the skip_hw_reset path.

Exactly that is what you should *not* do. Please don't add any new lock to the code. This is already complicated enough.

If you think that adding wrappers for reset lock makes sense then we can probably do that, bot don't add any lock for hw access.



     uint32_t seq;

-   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
-       !down_read_trylock(&adev->reset_domain->sem)) {
+   /*
+   * A GPU reset should flush all TLBs anyway, so no need to do
+   * this while one is ongoing.
+   */
+   if (!amdgpu_begin_hw_access(adev))
+           return 0;

+   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
+       amdgpu_in_reset(adev)) {
That check doesn't makes sense now any more.
same here, the two checks are for different scope, although I wasn't sure if the 
original check is correct or not, is there a possibility that 
!adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but 
amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens?

amdgpu_in_reset() in used incorrect in quite a lot of places. It should only be used inside the HW backend code to distinct between initial load and reset.

All other use cases especially the ones in the IOCTL front end functions as well as here in the midlayer which isn't used by GPU reset are incorrect.


@@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device 
*adev, uint16_t pasid,
     struct amdgpu_ring *ring = &adev->gfx.kiq[inst].ring;
     struct amdgpu_kiq *kiq = &adev->gfx.kiq[inst];
     unsigned int ndw;
-   signed long r;
+   signed long r = 0;
Please don't initialize local variables if it isn't necessary.

             if (adev->gmc.flush_tlb_needs_extra_type_2)
                     adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
                                                              2, all_hub,
@@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device 
*adev, uint16_t pasid,
             adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
                                                      flush_type, all_hub,
                                                      inst);
-           return 0;
-   }
+   } else {
That the locking is missing here should be a separate bug fix independent of 
other changes.
I will split this off into a seperate patch, initializing r is needed because I 
consolidated the return paths to drop the read lock.

In that case better set r when it's not initialized in some path.

Regards,
Christian.

Reply via email to