RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
[AMD Official Use Only - AMD Internal Distribution Only] Yes, the two places are 1. In debugfs and 2. In MI100's en/disable_debug_trap, and evidently someone is testing the debugfs interface because there's a bug fix for a race condition of it. Teddy
Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
Am 24.05.24 um 15:35 schrieb Li, Yunxiang (Teddy): [AMD Official Use Only - AMD Internal Distribution Only] If that is true you could in theory lower the locked area of the existing lock, but adding a new one is strict no-go from my side. I'll try this, right now I see two places where this would be problematic because they are trying to suspend either gfx or kfd scheduler and later resume it, which would race with the same in reset. It is exactly this two functions I'm trying to avoid putting in the scope of the second lock, because they are common to both in and outside reset. I don't think that avoiding the suspending and resuming the schedulers is worth moving the locks. That is something so rarely done that it shouldn't matter. Christian. Teddy
RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
[AMD Official Use Only - AMD Internal Distribution Only] > If that is true you could in theory lower the locked area of the existing > lock, but adding a new one is strict no-go from my side. I'll try this, right now I see two places where this would be problematic because they are trying to suspend either gfx or kfd scheduler and later resume it, which would race with the same in reset. It is exactly this two functions I'm trying to avoid putting in the scope of the second lock, because they are common to both in and outside reset. Teddy
Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
Am 23.05.24 um 17:35 schrieb Li, Yunxiang (Teddy): [Public] Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path. Exactly that is what you should *not* do. Please don't add any new lock to the code. This is already complicated enough. If you think that adding wrappers for reset lock makes sense then we can probably do that, bot don't add any lock for hw access. The two lock protects very different things though. The first case is we need to block two resets running in parallel, No, that's not correct. Two parallel resets are prevent by using a worker queue for the resets. The reset lock is here exactly to provide external thread the opportunity to make their operation mutual exclusive with the reset. this does not only cover GPU reset itself but also any teardown that happens before GPU reset. The second case is we need to ensure exclusive access to the GPU between GPU reset and GPU init, concurrent access is fine before GPU is reset. If that is true you could in theory lower the locked area of the existing lock, but adding a new one is strict no-go from my side. Regards, Christian. Theoretically, the second case happens within the first case, so locking the first case would protect against both. But with the current implementation this is infeasible, all the generic functions called between amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special versions so the reset thread does not dead lock itself. It is much simpler to have a second, much narrower lock that only covers GPU reset<->GPU init because all the accesses there are very low level anyway. Teddy
Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
On Thu, May 23, 2024 at 11:32 AM Christian König wrote: > > Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy): > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >>> +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void > >>> +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int > >>> +amdgpu_begin_hw_access(struct amdgpu_device *adev); void > >>> +amdgpu_end_hw_access(struct amdgpu_device *adev); > >> Don't add anything to amdgpu.h. We are slowly decomposing that file. > > Where would be a better place? I just wanted to have them next to > > amdgpu_in_reset > > amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over > there as well. > > > > >>> @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > >>> *adev, > >>> goto skip_hw_reset; > >>> } > >>> > >>> + amdgpu_lock_hw_access(adev); > >> That should already be locked here. So this will most likely deadlock. > >> > >>>retry:/* Rest of adevs pre asic reset from XGMI hive. */ > >>> list_for_each_entry(tmp_adev, device_list_handle, reset_list) { > >>> r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@ > >>> -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > >>> *adev, > >>> */ > >>> amdgpu_device_stop_pending_resets(tmp_adev); > >>> } > >>> + amdgpu_unlock_hw_access(adev); > >>> > >>>skip_hw_reset: > >> Don't add helpers for that, especially not with that name. > >> > >> We don't block HW access, but just prevent concurrent resets. > > Here is taking a different lock than the reset_domain->sem. It is a > > seperate reset_domain->gpu_sem that is only locked when we will actuall do > > reset, it is not taken in the skip_hw_reset path. > > Exactly that is what you should *not* do. Please don't add any new lock > to the code. This is already complicated enough. > > If you think that adding wrappers for reset lock makes sense then we can > probably do that, bot don't add any lock for hw access. > > > > > >>> uint32_t seq; > >>> > >>> - if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > >>> - !down_read_trylock(>reset_domain->sem)) { > >>> + /* > >>> + * A GPU reset should flush all TLBs anyway, so no need to do > >>> + * this while one is ongoing. > >>> + */ > >>> + if (!amdgpu_begin_hw_access(adev)) > >>> + return 0; > >>> > >>> + if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > >>> + amdgpu_in_reset(adev)) { > >> That check doesn't makes sense now any more. > > same here, the two checks are for different scope, although I wasn't sure > > if the original check is correct or not, is there a possibility that > > !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but > > amdgpu_in_reset(adev) is true? and to we want to take this branch when that > > happens? > > amdgpu_in_reset() in used incorrect in quite a lot of places. It should > only be used inside the HW backend code to distinct between initial load > and reset. > > All other use cases especially the ones in the IOCTL front end functions > as well as here in the midlayer which isn't used by GPU reset are incorrect. FWIW, I started to clean this up earlier this year, but never got around to finishing up the patches: https://lists.freedesktop.org/archives/amd-gfx/2024-February/104582.html Alex > > > > >>> @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct > >>> amdgpu_device *adev, uint16_t pasid, > >>> struct amdgpu_ring *ring = >gfx.kiq[inst].ring; > >>> struct amdgpu_kiq *kiq = >gfx.kiq[inst]; > >>> unsigned int ndw; > >>> - signed long r; > >>> + signed long r = 0; > >> Please don't initialize local variables if it isn't necessary. > >> > >>> if (adev->gmc.flush_tlb_needs_extra_type_2) > >>> adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > >>> 2, all_hub, > >>> @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct > >>> amdgpu_device *adev, uint16_t pasid, > >>> adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > >>> flush_type, all_hub, > >>> inst); > >>> - return 0; > >>> - } > >>> + } else { > >> That the locking is missing here should be a separate bug fix independent > >> of other changes. > > I will split this off into a seperate patch, initializing r is needed > > because I consolidated the return paths to drop the read lock. > > In that case better set r when it's not initialized in some path. > > Regards, > Christian. >
RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
[Public] > > Here is taking a different lock than the reset_domain->sem. It is a > > seperate reset_domain->gpu_sem that is only locked when we will actuall do > > reset, it is not taken in the skip_hw_reset path. > > Exactly that is what you should *not* do. Please don't add any new lock to > the code. This is already complicated enough. > > If you think that adding wrappers for reset lock makes sense then we can > probably do that, bot don't add any lock for hw access. The two lock protects very different things though. The first case is we need to block two resets running in parallel, this does not only cover GPU reset itself but also any teardown that happens before GPU reset. The second case is we need to ensure exclusive access to the GPU between GPU reset and GPU init, concurrent access is fine before GPU is reset. Theoretically, the second case happens within the first case, so locking the first case would protect against both. But with the current implementation this is infeasible, all the generic functions called between amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special versions so the reset thread does not dead lock itself. It is much simpler to have a second, much narrower lock that only covers GPU reset<->GPU init because all the accesses there are very low level anyway. Teddy
Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy): [AMD Official Use Only - AMD Internal Distribution Only] +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int +amdgpu_begin_hw_access(struct amdgpu_device *adev); void +amdgpu_end_hw_access(struct amdgpu_device *adev); Don't add anything to amdgpu.h. We are slowly decomposing that file. Where would be a better place? I just wanted to have them next to amdgpu_in_reset amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over there as well. @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, goto skip_hw_reset; } + amdgpu_lock_hw_access(adev); That should already be locked here. So this will most likely deadlock. retry:/* Rest of adevs pre asic reset from XGMI hive. */ list_for_each_entry(tmp_adev, device_list_handle, reset_list) { r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@ -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, */ amdgpu_device_stop_pending_resets(tmp_adev); } + amdgpu_unlock_hw_access(adev); skip_hw_reset: Don't add helpers for that, especially not with that name. We don't block HW access, but just prevent concurrent resets. Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path. Exactly that is what you should *not* do. Please don't add any new lock to the code. This is already complicated enough. If you think that adding wrappers for reset lock makes sense then we can probably do that, bot don't add any lock for hw access. uint32_t seq; - if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || - !down_read_trylock(>reset_domain->sem)) { + /* + * A GPU reset should flush all TLBs anyway, so no need to do + * this while one is ongoing. + */ + if (!amdgpu_begin_hw_access(adev)) + return 0; + if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || + amdgpu_in_reset(adev)) { That check doesn't makes sense now any more. same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens? amdgpu_in_reset() in used incorrect in quite a lot of places. It should only be used inside the HW backend code to distinct between initial load and reset. All other use cases especially the ones in the IOCTL front end functions as well as here in the midlayer which isn't used by GPU reset are incorrect. @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, struct amdgpu_ring *ring = >gfx.kiq[inst].ring; struct amdgpu_kiq *kiq = >gfx.kiq[inst]; unsigned int ndw; - signed long r; + signed long r = 0; Please don't initialize local variables if it isn't necessary. if (adev->gmc.flush_tlb_needs_extra_type_2) adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, 2, all_hub, @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, flush_type, all_hub, inst); - return 0; - } + } else { That the locking is missing here should be a separate bug fix independent of other changes. I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock. In that case better set r when it's not initialized in some path. Regards, Christian.
RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
[AMD Official Use Only - AMD Internal Distribution Only] > > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void > > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int > > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void > > +amdgpu_end_hw_access(struct amdgpu_device *adev); > > Don't add anything to amdgpu.h. We are slowly decomposing that file. Where would be a better place? I just wanted to have them next to amdgpu_in_reset > > @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > > *adev, > > goto skip_hw_reset; > > } > > > > + amdgpu_lock_hw_access(adev); > > That should already be locked here. So this will most likely deadlock. > > > retry:/* Rest of adevs pre asic reset from XGMI hive. */ > > list_for_each_entry(tmp_adev, device_list_handle, reset_list) { > > r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@ > > -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > > */ > > amdgpu_device_stop_pending_resets(tmp_adev); > > } > > + amdgpu_unlock_hw_access(adev); > > > > skip_hw_reset: > > Don't add helpers for that, especially not with that name. > > We don't block HW access, but just prevent concurrent resets. Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path. > > uint32_t seq; > > > > - if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > > - !down_read_trylock(>reset_domain->sem)) { > > + /* > > + * A GPU reset should flush all TLBs anyway, so no need to do > > + * this while one is ongoing. > > + */ > > + if (!amdgpu_begin_hw_access(adev)) > > + return 0; > > > > + if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready || > > + amdgpu_in_reset(adev)) { > > That check doesn't makes sense now any more. same here, the two checks are for different scope, although I wasn't sure if the original check is correct or not, is there a possibility that !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens? > > > > @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct > > amdgpu_device *adev, uint16_t pasid, > > struct amdgpu_ring *ring = >gfx.kiq[inst].ring; > > struct amdgpu_kiq *kiq = >gfx.kiq[inst]; > > unsigned int ndw; > > - signed long r; > > + signed long r = 0; > > Please don't initialize local variables if it isn't necessary. > > > if (adev->gmc.flush_tlb_needs_extra_type_2) > > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > > 2, all_hub, > > @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct > > amdgpu_device *adev, uint16_t pasid, > > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid, > > flush_type, all_hub, > > inst); > > - return 0; > > - } > > + } else { > > That the locking is missing here should be a separate bug fix independent of > other changes. I will split this off into a seperate patch, initializing r is needed because I consolidated the return paths to drop the read lock.
Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery
Am 22.05.24 um 19:27 schrieb Yunxiang Li: Random accesses to the GPU while it is not re-initialized can lead to a bad time. So add a rwsem to prevent such accesses. Normal accesses will now take the read lock for shared GPU access, reset takes the write lock for exclusive GPU access. Care need to be taken so that the recovery thread does not take the read lock and deadlock itself, and normal access should avoid waiting on the reset to finish and should instead treat the hardware access as failed. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 22 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 74 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 +- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 +- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 3 +- .../amd/amdkfd/kfd_process_queue_manager.c| 8 +- 9 files changed, 79 insertions(+), 49 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 1f71c7b98d77..5a089e2dec2a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1632,6 +1632,11 @@ static inline bool amdgpu_is_tmz(struct amdgpu_device *adev) int amdgpu_in_reset(struct amdgpu_device *adev); +void amdgpu_lock_hw_access(struct amdgpu_device *adev); +void amdgpu_unlock_hw_access(struct amdgpu_device *adev); +int amdgpu_begin_hw_access(struct amdgpu_device *adev); +void amdgpu_end_hw_access(struct amdgpu_device *adev); + Don't add anything to amdgpu.h. We are slowly decomposing that file. extern const struct attribute_group amdgpu_vram_mgr_attr_group; extern const struct attribute_group amdgpu_gtt_mgr_attr_group; extern const struct attribute_group amdgpu_flash_attr_group; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index e74789691070..057d735c7cae 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, goto skip_hw_reset; } + amdgpu_lock_hw_access(adev); That should already be locked here. So this will most likely deadlock. retry:/* Rest of adevs pre asic reset from XGMI hive. */ list_for_each_entry(tmp_adev, device_list_handle, reset_list) { r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@ -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, */ amdgpu_device_stop_pending_resets(tmp_adev); } + amdgpu_unlock_hw_access(adev); skip_hw_reset: @@ -6449,6 +6451,26 @@ int amdgpu_in_reset(struct amdgpu_device *adev) return atomic_read(>reset_domain->in_gpu_reset); } +void amdgpu_lock_hw_access(struct amdgpu_device *adev) +{ + down_write(>reset_domain->gpu_sem); +} + +void amdgpu_unlock_hw_access(struct amdgpu_device *adev) +{ + up_write(>reset_domain->gpu_sem); +} + +int amdgpu_begin_hw_access(struct amdgpu_device *adev) +{ + return down_read_trylock(>reset_domain->gpu_sem); +} + +void amdgpu_end_hw_access(struct amdgpu_device *adev) +{ + up_read(>reset_domain->gpu_sem); +} + Don't add helpers for that, especially not with that name. We don't block HW access, but just prevent concurrent resets. /** * amdgpu_device_halt() - bring hardware to some kind of halt state * diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index 603c0738fd03..098755db9d20 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -623,12 +623,11 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid, !adev->mman.buffer_funcs_enabled || !adev->ib_pool_ready || amdgpu_in_reset(adev) || !ring->sched.ready) { - /* * A GPU reset should flush all TLBs anyway, so no need to do * this while one is ongoing. */ - if (!down_read_trylock(>reset_domain->sem)) + if (!amdgpu_begin_hw_access(adev)) return; if (adev->gmc.flush_tlb_needs_extra_type_2) @@ -641,7 +640,8 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid, adev->gmc.gmc_funcs->flush_gpu_tlb(adev, vmid, vmhub, flush_type); - up_read(>reset_domain->sem); + + amdgpu_end_hw_access(adev); return; } @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid, struct amdgpu_ring *ring =