RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-24 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only]

Yes, the two places are 1. In debugfs and 2. In MI100's en/disable_debug_trap, 
and evidently someone is testing the debugfs interface because there's a bug 
fix for a race condition of it.

Teddy


Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-24 Thread Christian König

Am 24.05.24 um 15:35 schrieb Li, Yunxiang (Teddy):

[AMD Official Use Only - AMD Internal Distribution Only]


If that is true you could in theory lower the locked area of the existing lock, 
but adding a new one is strict no-go from my side.

I'll try this, right now I see two places where this would be problematic 
because they are trying to suspend either gfx or kfd scheduler and later resume 
it, which would race with the same in reset. It is exactly this two functions 
I'm trying to avoid putting in the scope of the second lock, because they are 
common to both in and outside reset.


I don't think that avoiding the suspending and resuming the schedulers 
is worth moving the locks.


That is something so rarely done that it shouldn't matter.

Christian.



Teddy




RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-24 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only]

> If that is true you could in theory lower the locked area of the existing 
> lock, but adding a new one is strict no-go from my side.

I'll try this, right now I see two places where this would be problematic 
because they are trying to suspend either gfx or kfd scheduler and later resume 
it, which would race with the same in reset. It is exactly this two functions 
I'm trying to avoid putting in the scope of the second lock, because they are 
common to both in and outside reset.

Teddy


Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-24 Thread Christian König

Am 23.05.24 um 17:35 schrieb Li, Yunxiang (Teddy):

[Public]


Here is taking a different lock than the reset_domain->sem. It is a seperate 
reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not 
taken in the skip_hw_reset path.

Exactly that is what you should *not* do. Please don't add any new lock to the 
code. This is already complicated enough.

If you think that adding wrappers for reset lock makes sense then we can 
probably do that, bot don't add any lock for hw access.

The two lock protects very different things though. The first case is we need 
to block two resets running in parallel,


No, that's not correct. Two parallel resets are prevent by using a 
worker queue for the resets.


The reset lock is here exactly to provide external thread the 
opportunity to make their operation mutual exclusive with the reset.



  this does not only cover GPU reset itself but also any teardown that happens 
before GPU reset. The second case is we need to ensure exclusive access to the 
GPU between GPU reset and GPU init, concurrent access is fine before GPU is 
reset.


If that is true you could in theory lower the locked area of the 
existing lock, but adding a new one is strict no-go from my side.


Regards,
Christian.



Theoretically, the second case happens within the first case, so locking the first 
case would protect against both. But with the current implementation this is 
infeasible, all the generic functions called between 
amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special 
versions so the reset thread does not dead lock itself. It is much simpler to have a 
second, much narrower lock that only covers GPU reset<->GPU init because all 
the accesses there are very low level anyway.

Teddy




Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Alex Deucher
On Thu, May 23, 2024 at 11:32 AM Christian König
 wrote:
>
> Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy):
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> >>> +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
> >>> +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
> >>> +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
> >>> +amdgpu_end_hw_access(struct amdgpu_device *adev);
> >> Don't add anything to amdgpu.h. We are slowly decomposing that file.
> > Where would be a better place? I just wanted to have them next to 
> > amdgpu_in_reset
>
> amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over
> there as well.
>
> >
> >>> @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> >>> *adev,
> >>>  goto skip_hw_reset;
> >>>  }
> >>>
> >>> +   amdgpu_lock_hw_access(adev);
> >> That should already be locked here. So this will most likely deadlock.
> >>
> >>>retry:/* Rest of adevs pre asic reset from XGMI hive. */
> >>>  list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> >>>  r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
> >>> -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> >>> *adev,
> >>>   */
> >>>  amdgpu_device_stop_pending_resets(tmp_adev);
> >>>  }
> >>> +   amdgpu_unlock_hw_access(adev);
> >>>
> >>>skip_hw_reset:
> >> Don't add helpers for that, especially not with that name.
> >>
> >> We don't block HW access, but just prevent concurrent resets.
> > Here is taking a different lock than the reset_domain->sem. It is a 
> > seperate reset_domain->gpu_sem that is only locked when we will actuall do 
> > reset, it is not taken in the skip_hw_reset path.
>
> Exactly that is what you should *not* do. Please don't add any new lock
> to the code. This is already complicated enough.
>
> If you think that adding wrappers for reset lock makes sense then we can
> probably do that, bot don't add any lock for hw access.
>
>
> >
> >>>  uint32_t seq;
> >>>
> >>> -   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> >>> -   !down_read_trylock(>reset_domain->sem)) {
> >>> +   /*
> >>> +   * A GPU reset should flush all TLBs anyway, so no need to do
> >>> +   * this while one is ongoing.
> >>> +   */
> >>> +   if (!amdgpu_begin_hw_access(adev))
> >>> +   return 0;
> >>>
> >>> +   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> >>> +   amdgpu_in_reset(adev)) {
> >> That check doesn't makes sense now any more.
> > same here, the two checks are for different scope, although I wasn't sure 
> > if the original check is correct or not, is there a possibility that 
> > !adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but 
> > amdgpu_in_reset(adev) is true? and to we want to take this branch when that 
> > happens?
>
> amdgpu_in_reset() in used incorrect in quite a lot of places. It should
> only be used inside the HW backend code to distinct between initial load
> and reset.
>
> All other use cases especially the ones in the IOCTL front end functions
> as well as here in the midlayer which isn't used by GPU reset are incorrect.

FWIW, I started to clean this up earlier this year, but never got
around to finishing up the patches:
https://lists.freedesktop.org/archives/amd-gfx/2024-February/104582.html

Alex

>
> >
> >>> @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct 
> >>> amdgpu_device *adev, uint16_t pasid,
> >>>  struct amdgpu_ring *ring = >gfx.kiq[inst].ring;
> >>>  struct amdgpu_kiq *kiq = >gfx.kiq[inst];
> >>>  unsigned int ndw;
> >>> -   signed long r;
> >>> +   signed long r = 0;
> >> Please don't initialize local variables if it isn't necessary.
> >>
> >>>  if (adev->gmc.flush_tlb_needs_extra_type_2)
> >>>  adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >>>   2, all_hub,
> >>> @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct 
> >>> amdgpu_device *adev, uint16_t pasid,
> >>>  adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >>>   flush_type, all_hub,
> >>>   inst);
> >>> -   return 0;
> >>> -   }
> >>> +   } else {
> >> That the locking is missing here should be a separate bug fix independent 
> >> of other changes.
> > I will split this off into a seperate patch, initializing r is needed 
> > because I consolidated the return paths to drop the read lock.
>
> In that case better set r when it's not initialized in some path.
>
> Regards,
> Christian.
>


RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Li, Yunxiang (Teddy)
[Public]

> > Here is taking a different lock than the reset_domain->sem. It is a 
> > seperate reset_domain->gpu_sem that is only locked when we will actuall do 
> > reset, it is not taken in the skip_hw_reset path.
>
> Exactly that is what you should *not* do. Please don't add any new lock to 
> the code. This is already complicated enough.
>
> If you think that adding wrappers for reset lock makes sense then we can 
> probably do that, bot don't add any lock for hw access.

The two lock protects very different things though. The first case is we need 
to block two resets running in parallel, this does not only cover GPU reset 
itself but also any teardown that happens before GPU reset. The second case is 
we need to ensure exclusive access to the GPU between GPU reset and GPU init, 
concurrent access is fine before GPU is reset.

Theoretically, the second case happens within the first case, so locking the 
first case would protect against both. But with the current implementation this 
is infeasible, all the generic functions called between 
amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special 
versions so the reset thread does not dead lock itself. It is much simpler to 
have a second, much narrower lock that only covers GPU reset<->GPU init because 
all the accesses there are very low level anyway.

Teddy


Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Christian König

Am 23.05.24 um 13:36 schrieb Li, Yunxiang (Teddy):

[AMD Official Use Only - AMD Internal Distribution Only]


+void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
+amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
+amdgpu_begin_hw_access(struct amdgpu_device *adev); void
+amdgpu_end_hw_access(struct amdgpu_device *adev);

Don't add anything to amdgpu.h. We are slowly decomposing that file.

Where would be a better place? I just wanted to have them next to 
amdgpu_in_reset


amdgpu_reset.h if you have time feel free to move amdgpu_in_reset() over 
there as well.





@@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 goto skip_hw_reset;
 }

+   amdgpu_lock_hw_access(adev);

That should already be locked here. So this will most likely deadlock.


   retry:/* Rest of adevs pre asic reset from XGMI hive. */
 list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
 r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
-5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  */
 amdgpu_device_stop_pending_resets(tmp_adev);
 }
+   amdgpu_unlock_hw_access(adev);

   skip_hw_reset:

Don't add helpers for that, especially not with that name.

We don't block HW access, but just prevent concurrent resets.

Here is taking a different lock than the reset_domain->sem. It is a seperate 
reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not 
taken in the skip_hw_reset path.


Exactly that is what you should *not* do. Please don't add any new lock 
to the code. This is already complicated enough.


If you think that adding wrappers for reset lock makes sense then we can 
probably do that, bot don't add any lock for hw access.






 uint32_t seq;

-   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
-   !down_read_trylock(>reset_domain->sem)) {
+   /*
+   * A GPU reset should flush all TLBs anyway, so no need to do
+   * this while one is ongoing.
+   */
+   if (!amdgpu_begin_hw_access(adev))
+   return 0;

+   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
+   amdgpu_in_reset(adev)) {

That check doesn't makes sense now any more.

same here, the two checks are for different scope, although I wasn't sure if the 
original check is correct or not, is there a possibility that 
!adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but 
amdgpu_in_reset(adev) is true? and to we want to take this branch when that happens?


amdgpu_in_reset() in used incorrect in quite a lot of places. It should 
only be used inside the HW backend code to distinct between initial load 
and reset.


All other use cases especially the ones in the IOCTL front end functions 
as well as here in the midlayer which isn't used by GPU reset are incorrect.





@@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device 
*adev, uint16_t pasid,
 struct amdgpu_ring *ring = >gfx.kiq[inst].ring;
 struct amdgpu_kiq *kiq = >gfx.kiq[inst];
 unsigned int ndw;
-   signed long r;
+   signed long r = 0;

Please don't initialize local variables if it isn't necessary.


 if (adev->gmc.flush_tlb_needs_extra_type_2)
 adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
  2, all_hub,
@@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device 
*adev, uint16_t pasid,
 adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
  flush_type, all_hub,
  inst);
-   return 0;
-   }
+   } else {

That the locking is missing here should be a separate bug fix independent of 
other changes.

I will split this off into a seperate patch, initializing r is needed because I 
consolidated the return paths to drop the read lock.


In that case better set r when it's not initialized in some path.

Regards,
Christian.



RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only]

> > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int
> > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void
> > +amdgpu_end_hw_access(struct amdgpu_device *adev);
>
> Don't add anything to amdgpu.h. We are slowly decomposing that file.

Where would be a better place? I just wanted to have them next to 
amdgpu_in_reset

> > @@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> > *adev,
> > goto skip_hw_reset;
> > }
> >
> > +   amdgpu_lock_hw_access(adev);
>
> That should already be locked here. So this will most likely deadlock.
>
> >   retry:/* Rest of adevs pre asic reset from XGMI hive. */
> > list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> > r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context); @@
> > -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >  */
> > amdgpu_device_stop_pending_resets(tmp_adev);
> > }
> > +   amdgpu_unlock_hw_access(adev);
> >
> >   skip_hw_reset:
>
> Don't add helpers for that, especially not with that name.
>
> We don't block HW access, but just prevent concurrent resets.

Here is taking a different lock than the reset_domain->sem. It is a seperate 
reset_domain->gpu_sem that is only locked when we will actuall do reset, it is 
not taken in the skip_hw_reset path.

> > uint32_t seq;
> >
> > -   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> > -   !down_read_trylock(>reset_domain->sem)) {
> > +   /*
> > +   * A GPU reset should flush all TLBs anyway, so no need to do
> > +   * this while one is ongoing.
> > +   */
> > +   if (!amdgpu_begin_hw_access(adev))
> > +   return 0;
> >
> > +   if (!adev->gmc.flush_pasid_uses_kiq || !ring->sched.ready ||
> > +   amdgpu_in_reset(adev)) {
>
> That check doesn't makes sense now any more.

same here, the two checks are for different scope, although I wasn't sure if 
the original check is correct or not, is there a possibility that 
!adev->gmc.flush_pasid_uses_kiq and !ring->sched.ready are false but 
amdgpu_in_reset(adev) is true? and to we want to take this branch when that 
happens?

> >
> > @@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct 
> > amdgpu_device *adev, uint16_t pasid,
> > struct amdgpu_ring *ring = >gfx.kiq[inst].ring;
> > struct amdgpu_kiq *kiq = >gfx.kiq[inst];
> > unsigned int ndw;
> > -   signed long r;
> > +   signed long r = 0;
>
> Please don't initialize local variables if it isn't necessary.
>
> > if (adev->gmc.flush_tlb_needs_extra_type_2)
> > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >  2, all_hub,
> > @@ -703,46 +709,42 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct 
> > amdgpu_device *adev, uint16_t pasid,
> > adev->gmc.gmc_funcs->flush_gpu_tlb_pasid(adev, pasid,
> >  flush_type, all_hub,
> >  inst);
> > -   return 0;
> > -   }
> > +   } else {
>
> That the locking is missing here should be a separate bug fix independent of 
> other changes.

I will split this off into a seperate patch, initializing r is needed because I 
consolidated the return paths to drop the read lock.


Re: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Christian König

Am 22.05.24 um 19:27 schrieb Yunxiang Li:

Random accesses to the GPU while it is not re-initialized can lead to a
bad time. So add a rwsem to prevent such accesses. Normal accesses will
now take the read lock for shared GPU access, reset takes the write lock
for exclusive GPU access.

Care need to be taken so that the recovery thread does not take the read
lock and deadlock itself, and normal access should avoid waiting on the
reset to finish and should instead treat the hardware access as failed.

Signed-off-by: Yunxiang Li 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  5 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 22 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c   | 74 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  1 +
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  7 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  7 +-
  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  3 +-
  .../amd/amdkfd/kfd_process_queue_manager.c|  8 +-
  9 files changed, 79 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 1f71c7b98d77..5a089e2dec2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1632,6 +1632,11 @@ static inline bool amdgpu_is_tmz(struct amdgpu_device 
*adev)
  
  int amdgpu_in_reset(struct amdgpu_device *adev);
  
+void amdgpu_lock_hw_access(struct amdgpu_device *adev);

+void amdgpu_unlock_hw_access(struct amdgpu_device *adev);
+int amdgpu_begin_hw_access(struct amdgpu_device *adev);
+void amdgpu_end_hw_access(struct amdgpu_device *adev);
+


Don't add anything to amdgpu.h. We are slowly decomposing that file.


  extern const struct attribute_group amdgpu_vram_mgr_attr_group;
  extern const struct attribute_group amdgpu_gtt_mgr_attr_group;
  extern const struct attribute_group amdgpu_flash_attr_group;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e74789691070..057d735c7cae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5816,6 +5816,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
goto skip_hw_reset;
}
  
+	amdgpu_lock_hw_access(adev);


That should already be locked here. So this will most likely deadlock.


  retry:/* Rest of adevs pre asic reset from XGMI hive. */
list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);
@@ -5852,6 +5853,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 */
amdgpu_device_stop_pending_resets(tmp_adev);
}
+   amdgpu_unlock_hw_access(adev);
  
  skip_hw_reset:
  
@@ -6449,6 +6451,26 @@ int amdgpu_in_reset(struct amdgpu_device *adev)

return atomic_read(>reset_domain->in_gpu_reset);
  }
  
+void amdgpu_lock_hw_access(struct amdgpu_device *adev)

+{
+   down_write(>reset_domain->gpu_sem);
+}
+
+void amdgpu_unlock_hw_access(struct amdgpu_device *adev)
+{
+   up_write(>reset_domain->gpu_sem);
+}
+
+int amdgpu_begin_hw_access(struct amdgpu_device *adev)
+{
+   return down_read_trylock(>reset_domain->gpu_sem);
+}
+
+void amdgpu_end_hw_access(struct amdgpu_device *adev)
+{
+   up_read(>reset_domain->gpu_sem);
+}
+


Don't add helpers for that, especially not with that name.

We don't block HW access, but just prevent concurrent resets.


  /**
   * amdgpu_device_halt() - bring hardware to some kind of halt state
   *
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 603c0738fd03..098755db9d20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -623,12 +623,11 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, 
uint32_t vmid,
!adev->mman.buffer_funcs_enabled ||
!adev->ib_pool_ready || amdgpu_in_reset(adev) ||
!ring->sched.ready) {
-
/*
 * A GPU reset should flush all TLBs anyway, so no need to do
 * this while one is ongoing.
 */
-   if (!down_read_trylock(>reset_domain->sem))
+   if (!amdgpu_begin_hw_access(adev))
return;
  
  		if (adev->gmc.flush_tlb_needs_extra_type_2)

@@ -641,7 +640,8 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, 
uint32_t vmid,
  
  		adev->gmc.gmc_funcs->flush_gpu_tlb(adev, vmid, vmhub,

   flush_type);
-   up_read(>reset_domain->sem);
+
+   amdgpu_end_hw_access(adev);
return;
}
  
@@ -684,12 +684,18 @@ int amdgpu_gmc_flush_gpu_tlb_pasid(struct amdgpu_device *adev, uint16_t pasid,

struct amdgpu_ring *ring =