Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-20 Thread Andrey Grodzovsky

On 2021-12-20 12:06 p.m., Liu, Shaoyun wrote:


[AMD Official Use Only]


Hi , Andrey
I actually has some concerns about this  change .
1.  on SRIOV configuration , the reset notify coming  from host , and driver 
already trigger a work queue to handle the reset (check 
xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue 
inside the work queue ?  Can  we just use the  new one  you added ?



Shouldn't be a problem,  i will change. In fact it's a great idea 
because then it looks like we can totally drop 'adev->in_gpu_reset' 
since we don't need to lock again concurrent resets anymore




2. For KFD,  the rocm use the user queue for the submission and it won't call 
the drm scheduler  and hence no job timeout.  Can  we handle that with  your 
new change ?



I think that not a problem - a lot of places use direct submissions and 
not scheduler, in case they need to synchronize against concurrent GPU 
resets they lock adev->reset_sem. Nothing changes in this sense.



  
3 . For XGMI  hive, there is only hive  reset for all devices on bare-metal  config ,  but for SRIOV config , the VF will support VF FLR, which means host might only need to reset specific device instead trigger whole hive reset . So we might still need  reset_domain for individual device within the hive for SRIOV configuration.



This is something future right ? I don't see it in the code - in this 
case we will have to account for this as part of the generic design for 
this kind of single device reset within XGMI hive. It should require 
only a minor addition to current design in creating 2 parallel reset 
domains - one for hive and one per device.





Anyway I think this change need to be verified on sriov configuration on XGMI 
with  some rocm use app is running .



I do have XGMI setup where I still test XGMI resets. It has ROCm stack 
there - can you please login there and tell me if I have what needed 
there to do the tests you advise ? I am not very familiar with ROCm 
tools as i usually test using libdrm. (Ping me on teams for the device 
ip and user name)


Andrey




Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Monday, December 20, 2021 2:25 AM
To: Grodzovsky, Andrey ; 
dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org
Cc: dan...@ffwll.ch; Chen, Horace ; Koenig, Christian 
; Liu, Monk 
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:

This patchset is based on earlier work by Boris[1] that allowed to
have an ordered workqueue at the driver level that will be used by the
different schedulers to queue their timeout work. On top of that I
also serialized any GPU reset we trigger from within amdgpu code to
also go through the same ordered wq and in this way simplify somewhat
our GPU reset code so we don't need to protect from concurrency by
multiple GPU reset triggeres such as TDR on one hand and sysfs trigger or RAS 
trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct
such that all the entities that go through reset together will be
serialized one against another.

TDR triggered by multiple entities within the same domain due to the
same reason will not be triggered as the first such reset will cancel
all the pending resets. This is relevant only to TDR timers and not to
triggered resets coming from RAS or SYSFS, those will still happen after the in 
flight resets finishes.

[1]
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc
hwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210629073510.276439
1-3-boris.brezillon%40collabora.com%2Fdata=04%7C01%7CShaoyun.Liu%
40amd.com%7C1d2b07ad556b4da5d58808d9c389decf%7C3dd8961fe4884e608e11a82
d994e183d%7C0%7C0%7C637755819206627827%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000
;sdata=8C8UbdPmM%2FH6sdTYDP5lZfRfBdQ%2B%2FN7m6s%2FREW8%2BsoM%3Dre
served=0

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König 


Some minor comments on the rest, but in general absolutely looks like the way 
we want to go.

Regards,
Christian.


Andrey Grodzovsky (6):
drm/amdgpu: Init GPU reset single threaded wq
drm/amdgpu: Move scheduler init to after XGMI is ready
drm/amdgpu: Fix crash on modprobe
drm/amdgpu: Serialize non TDR gpu recovery with TDRs
drm/amdgpu: Drop hive->in_reset
drm/amdgpu: Drop concurrent GPU reset protection for device

   drivers/gpu/drm/amd/amdgpu/amdgpu.h|   9 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|   2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
   

RE: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-20 Thread Liu, Shaoyun
[AMD Official Use Only]


Hi , Andrey 
I actually has some concerns about this  change . 
1.  on SRIOV configuration , the reset notify coming  from host , and driver 
already trigger a work queue to handle the reset (check 
xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue 
inside the work queue ?  Can  we just use the  new one  you added ? 
2. For KFD,  the rocm use the user queue for the submission and it won't call 
the drm scheduler  and hence no job timeout.  Can  we handle that with  your 
new change ? 
3 . For XGMI  hive, there is only hive  reset for all devices on bare-metal  
config ,  but for SRIOV config , the VF will support VF FLR, which means host 
might only need to reset specific device instead trigger whole hive reset . So 
we might still need  reset_domain for individual device within the hive for 
SRIOV configuration. 

Anyway I think this change need to be verified on sriov configuration on XGMI 
with  some rocm use app is running . 

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Monday, December 20, 2021 2:25 AM
To: Grodzovsky, Andrey ; 
dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org
Cc: dan...@ffwll.ch; Chen, Horace ; Koenig, Christian 
; Liu, Monk 
Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> This patchset is based on earlier work by Boris[1] that allowed to 
> have an ordered workqueue at the driver level that will be used by the 
> different schedulers to queue their timeout work. On top of that I 
> also serialized any GPU reset we trigger from within amdgpu code to 
> also go through the same ordered wq and in this way simplify somewhat 
> our GPU reset code so we don't need to protect from concurrency by 
> multiple GPU reset triggeres such as TDR on one hand and sysfs trigger or RAS 
> trigger on the other hand.
>
> As advised by Christian and Daniel I defined a reset_domain struct 
> such that all the entities that go through reset together will be 
> serialized one against another.
>
> TDR triggered by multiple entities within the same domain due to the 
> same reason will not be triggered as the first such reset will cancel 
> all the pending resets. This is relevant only to TDR timers and not to 
> triggered resets coming from RAS or SYSFS, those will still happen after the 
> in flight resets finishes.
>
> [1] 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatc
> hwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210629073510.276439
> 1-3-boris.brezillon%40collabora.com%2Fdata=04%7C01%7CShaoyun.Liu%
> 40amd.com%7C1d2b07ad556b4da5d58808d9c389decf%7C3dd8961fe4884e608e11a82
> d994e183d%7C0%7C0%7C637755819206627827%7CUnknown%7CTWFpbGZsb3d8eyJWIjo
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000
> ;sdata=8C8UbdPmM%2FH6sdTYDP5lZfRfBdQ%2B%2FN7m6s%2FREW8%2BsoM%3Dre
> served=0
>
> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
> hasn't landed yet there.

Patches #1 and #5, #6 are Reviewed-by: Christian König 


Some minor comments on the rest, but in general absolutely looks like the way 
we want to go.

Regards,
Christian.

>
> Andrey Grodzovsky (6):
>drm/amdgpu: Init GPU reset single threaded wq
>drm/amdgpu: Move scheduler init to after XGMI is ready
>drm/amdgpu: Fix crash on modprobe
>drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>drm/amdgpu: Drop hive->in_reset
>drm/amdgpu: Drop concurrent GPU reset protection for device
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h|   9 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
>   7 files changed, 132 insertions(+), 136 deletions(-)
>


Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-20 Thread Daniel Vetter
On Mon, Dec 20, 2021 at 08:25:05AM +0100, Christian König wrote:
> Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:
> > This patchset is based on earlier work by Boris[1] that allowed to have an
> > ordered workqueue at the driver level that will be used by the different
> > schedulers to queue their timeout work. On top of that I also serialized
> > any GPU reset we trigger from within amdgpu code to also go through the same
> > ordered wq and in this way simplify somewhat our GPU reset code so we don't 
> > need
> > to protect from concurrency by multiple GPU reset triggeres such as TDR on 
> > one
> > hand and sysfs trigger or RAS trigger on the other hand.
> > 
> > As advised by Christian and Daniel I defined a reset_domain struct such that
> > all the entities that go through reset together will be serialized one 
> > against
> > another.
> > 
> > TDR triggered by multiple entities within the same domain due to the same 
> > reason will not
> > be triggered as the first such reset will cancel all the pending resets. 
> > This is
> > relevant only to TDR timers and not to triggered resets coming from RAS or 
> > SYSFS,
> > those will still happen after the in flight resets finishes.
> > 
> > [1] 
> > https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/
> > 
> > P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
> > hasn't landed yet there.
> 
> Patches #1 and #5, #6 are Reviewed-by: Christian König
> 
> 
> Some minor comments on the rest, but in general absolutely looks like the
> way we want to go.

I only scrolled through quickly, but yeah I'm concurring.
-Daniel
> 
> Regards,
> Christian.
> 
> > 
> > Andrey Grodzovsky (6):
> >drm/amdgpu: Init GPU reset single threaded wq
> >drm/amdgpu: Move scheduler init to after XGMI is ready
> >drm/amdgpu: Fix crash on modprobe
> >drm/amdgpu: Serialize non TDR gpu recovery with TDRs
> >drm/amdgpu: Drop hive->in_reset
> >drm/amdgpu: Drop concurrent GPU reset protection for device
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h|   9 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|   2 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
> >   7 files changed, 132 insertions(+), 136 deletions(-)
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-19 Thread Christian König

Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky:

This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another.

TDR triggered by multiple entities within the same domain due to the same 
reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or 
SYSFS,
those will still happen after the in flight resets finishes.

[1] 
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
hasn't landed yet there.


Patches #1 and #5, #6 are Reviewed-by: Christian König 



Some minor comments on the rest, but in general absolutely looks like 
the way we want to go.


Regards,
Christian.



Andrey Grodzovsky (6):
   drm/amdgpu: Init GPU reset single threaded wq
   drm/amdgpu: Move scheduler init to after XGMI is ready
   drm/amdgpu: Fix crash on modprobe
   drm/amdgpu: Serialize non TDR gpu recovery with TDRs
   drm/amdgpu: Drop hive->in_reset
   drm/amdgpu: Drop concurrent GPU reset protection for device

  drivers/gpu/drm/amd/amdgpu/amdgpu.h|   9 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|   2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
  7 files changed, 132 insertions(+), 136 deletions(-)





[RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-17 Thread Andrey Grodzovsky
This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another. 

TDR triggered by multiple entities within the same domain due to the same 
reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or 
SYSFS,
those will still happen after the in flight resets finishes.

[1] 
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
hasn't landed yet there.

Andrey Grodzovsky (6):
  drm/amdgpu: Init GPU reset single threaded wq
  drm/amdgpu: Move scheduler init to after XGMI is ready
  drm/amdgpu: Fix crash on modprobe
  drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  drm/amdgpu: Drop hive->in_reset
  drm/amdgpu: Drop concurrent GPU reset protection for device

 drivers/gpu/drm/amd/amdgpu/amdgpu.h|   9 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 206 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  36 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |   3 +-
 7 files changed, 132 insertions(+), 136 deletions(-)

-- 
2.25.1