Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-02-09 Thread Andrey Grodzovsky

Thanks a lot!

Andrey

On 2022-02-09 01:06, JingWen Chen wrote:

Hi Andrey,

I have been testing your patch and it seems fine till now.

Best Regards,

Jingwen Chen

On 2022/2/3 上午2:57, Andrey Grodzovsky wrote:

Just another ping, with Shyun's help I was able to do some smoke testing on 
XGMI SRIOV system (booting and triggering hive reset)
and for now looks good.

Andrey

On 2022-01-28 14:36, Andrey Grodzovsky wrote:

Just a gentle ping if people have more comments on this patch set ? Especially 
last 5 patches
as first 7 are exact same as V2 and we already went over them mostly.

Andrey

On 2022-01-25 17:37, Andrey Grodzovsky wrote:

This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another.

TDR triggered by multiple entities within the same domain due to the same 
reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or 
SYSFS,
those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host
and driver already trigger a work queue to handle the reset so drop this
intermediate wq and send directly to timeout wq. (Shaoyun)

v3:
Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct.
I followed his advise and also moved adev->reset_sem into same place. This
in turn caused to do some follow-up refactor of the original patches
where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive 
is destroyed and
reconstructed for the case of reset the devices in the XGMI hive during probe 
for SRIOV See [2]
while we need the reset sem and gpu_reset flag to always be present. This was 
attained
by adding refcount to amdgpu_reset_domain so each device can safely point to it 
as long as
it needs.


[1] 
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/
[2] https://www.spinics.net/lists/amd-gfx/msg58836.html

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
hasn't landed yet there.

P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.

P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system 
because drm-misc-next fails to load there.
Would appriciate if maybe jingwech can try it on his system like he tested V2.

Andrey Grodzovsky (12):
    drm/amdgpu: Introduce reset domain
    drm/amdgpu: Move scheduler init to after XGMI is ready
    drm/amdgpu: Fix crash on modprobe
    drm/amdgpu: Serialize non TDR gpu recovery with TDRs
    drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
    drm/amdgpu: Drop hive->in_reset
    drm/amdgpu: Drop concurrent GPU reset protection for device
    drm/amdgpu: Rework reset domain to be refcounted.
    drm/amdgpu: Move reset sem into reset_domain
    drm/amdgpu: Move in_gpu_reset into reset_domain
    drm/amdgpu: Rework amdgpu_device_lock_adev
    Revert 'drm/amdgpu: annotate a false positive recursive locking'

   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  15 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  10 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 275 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  43 +--
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  18 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  39 +++
   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  12 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |   2 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  24 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c    |   6 +-
   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  14 +-
   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  19 +-
   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  19 +-
   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c |  11 +-
   16 files changed, 313 insertions(+), 199 deletions(-)



Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-02-08 Thread JingWen Chen
Hi Andrey,

I have been testing your patch and it seems fine till now.

Best Regards,

Jingwen Chen

On 2022/2/3 上午2:57, Andrey Grodzovsky wrote:
> Just another ping, with Shyun's help I was able to do some smoke testing on 
> XGMI SRIOV system (booting and triggering hive reset)
> and for now looks good.
>
> Andrey
>
> On 2022-01-28 14:36, Andrey Grodzovsky wrote:
>> Just a gentle ping if people have more comments on this patch set ? 
>> Especially last 5 patches
>> as first 7 are exact same as V2 and we already went over them mostly.
>>
>> Andrey
>>
>> On 2022-01-25 17:37, Andrey Grodzovsky wrote:
>>> This patchset is based on earlier work by Boris[1] that allowed to have an
>>> ordered workqueue at the driver level that will be used by the different
>>> schedulers to queue their timeout work. On top of that I also serialized
>>> any GPU reset we trigger from within amdgpu code to also go through the same
>>> ordered wq and in this way simplify somewhat our GPU reset code so we don't 
>>> need
>>> to protect from concurrency by multiple GPU reset triggeres such as TDR on 
>>> one
>>> hand and sysfs trigger or RAS trigger on the other hand.
>>>
>>> As advised by Christian and Daniel I defined a reset_domain struct such that
>>> all the entities that go through reset together will be serialized one 
>>> against
>>> another.
>>>
>>> TDR triggered by multiple entities within the same domain due to the same 
>>> reason will not
>>> be triggered as the first such reset will cancel all the pending resets. 
>>> This is
>>> relevant only to TDR timers and not to triggered resets coming from RAS or 
>>> SYSFS,
>>> those will still happen after the in flight resets finishes.
>>>
>>> v2:
>>> Add handling on SRIOV configuration, the reset notify coming from host
>>> and driver already trigger a work queue to handle the reset so drop this
>>> intermediate wq and send directly to timeout wq. (Shaoyun)
>>>
>>> v3:
>>> Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct.
>>> I followed his advise and also moved adev->reset_sem into same place. This
>>> in turn caused to do some follow-up refactor of the original patches
>>> where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because 
>>> hive is destroyed and
>>> reconstructed for the case of reset the devices in the XGMI hive during 
>>> probe for SRIOV See [2]
>>> while we need the reset sem and gpu_reset flag to always be present. This 
>>> was attained
>>> by adding refcount to amdgpu_reset_domain so each device can safely point 
>>> to it as long as
>>> it needs.
>>>
>>>
>>> [1] 
>>> https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/
>>> [2] https://www.spinics.net/lists/amd-gfx/msg58836.html
>>>
>>> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
>>> hasn't landed yet there.
>>>
>>> P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.
>>>
>>> P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system 
>>> because drm-misc-next fails to load there.
>>> Would appriciate if maybe jingwech can try it on his system like he tested 
>>> V2.
>>>
>>> Andrey Grodzovsky (12):
>>>    drm/amdgpu: Introduce reset domain
>>>    drm/amdgpu: Move scheduler init to after XGMI is ready
>>>    drm/amdgpu: Fix crash on modprobe
>>>    drm/amdgpu: Serialize non TDR gpu recovery with TDRs
>>>    drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
>>>    drm/amdgpu: Drop hive->in_reset
>>>    drm/amdgpu: Drop concurrent GPU reset protection for device
>>>    drm/amdgpu: Rework reset domain to be refcounted.
>>>    drm/amdgpu: Move reset sem into reset_domain
>>>    drm/amdgpu: Move in_gpu_reset into reset_domain
>>>    drm/amdgpu: Rework amdgpu_device_lock_adev
>>>    Revert 'drm/amdgpu: annotate a false positive recursive locking'
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  15 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  10 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 275 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  43 +--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
>>>   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  18 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  39 +++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  12 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |   2 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  24 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
>>>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c    |   6 +-
>>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  14 +-
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  19 +-
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  19 +-
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c |  11 +-
>>>   16 files changed, 313 insertions(+), 199 deletions(-)
>>>


Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-02-02 Thread Andrey Grodzovsky
Just another ping, with Shyun's help I was able to do some smoke testing 
on XGMI SRIOV system (booting and triggering hive reset)

and for now looks good.

Andrey

On 2022-01-28 14:36, Andrey Grodzovsky wrote:
Just a gentle ping if people have more comments on this patch set ? 
Especially last 5 patches

as first 7 are exact same as V2 and we already went over them mostly.

Andrey

On 2022-01-25 17:37, Andrey Grodzovsky wrote:
This patchset is based on earlier work by Boris[1] that allowed to 
have an

ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through 
the same
ordered wq and in this way simplify somewhat our GPU reset code so we 
don't need
to protect from concurrency by multiple GPU reset triggeres such as 
TDR on one

hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct 
such that
all the entities that go through reset together will be serialized 
one against

another.

TDR triggered by multiple entities within the same domain due to the 
same reason will not
be triggered as the first such reset will cancel all the pending 
resets. This is
relevant only to TDR timers and not to triggered resets coming from 
RAS or SYSFS,

those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host
and driver already trigger a work queue to handle the reset so drop this
intermediate wq and send directly to timeout wq. (Shaoyun)

v3:
Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain 
struct.
I followed his advise and also moved adev->reset_sem into same place. 
This

in turn caused to do some follow-up refactor of the original patches
where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive 
because hive is destroyed and
reconstructed for the case of reset the devices in the XGMI hive 
during probe for SRIOV See [2]
while we need the reset sem and gpu_reset flag to always be present. 
This was attained
by adding refcount to amdgpu_reset_domain so each device can safely 
point to it as long as

it needs.


[1] 
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/

[2] https://www.spinics.net/lists/amd-gfx/msg58836.html

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris 
work hasn't landed yet there.


P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.

P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV 
system because drm-misc-next fails to load there.
Would appriciate if maybe jingwech can try it on his system like he 
tested V2.


Andrey Grodzovsky (12):
   drm/amdgpu: Introduce reset domain
   drm/amdgpu: Move scheduler init to after XGMI is ready
   drm/amdgpu: Fix crash on modprobe
   drm/amdgpu: Serialize non TDR gpu recovery with TDRs
   drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
   drm/amdgpu: Drop hive->in_reset
   drm/amdgpu: Drop concurrent GPU reset protection for device
   drm/amdgpu: Rework reset domain to be refcounted.
   drm/amdgpu: Move reset sem into reset_domain
   drm/amdgpu: Move in_gpu_reset into reset_domain
   drm/amdgpu: Rework amdgpu_device_lock_adev
   Revert 'drm/amdgpu: annotate a false positive recursive locking'

  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  15 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 275 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  43 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    |  18 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  39 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  12 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |   2 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  24 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c    |   6 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  14 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  19 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  19 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c |  11 +-
  16 files changed, 313 insertions(+), 199 deletions(-)



Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-01-28 Thread Andrey Grodzovsky
Just a gentle ping if people have more comments on this patch set ? 
Especially last 5 patches

as first 7 are exact same as V2 and we already went over them mostly.

Andrey

On 2022-01-25 17:37, Andrey Grodzovsky wrote:

This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another.

TDR triggered by multiple entities within the same domain due to the same 
reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or 
SYSFS,
those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host
and driver already trigger a work queue to handle the reset so drop this
intermediate wq and send directly to timeout wq. (Shaoyun)

v3:
Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct.
I followed his advise and also moved adev->reset_sem into same place. This
in turn caused to do some follow-up refactor of the original patches
where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive 
is destroyed and
reconstructed for the case of reset the devices in the XGMI hive during probe 
for SRIOV See [2]
while we need the reset sem and gpu_reset flag to always be present. This was 
attained
by adding refcount to amdgpu_reset_domain so each device can safely point to it 
as long as
it needs.


[1] 
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/
[2] https://www.spinics.net/lists/amd-gfx/msg58836.html

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
hasn't landed yet there.

P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.

P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system 
because drm-misc-next fails to load there.
Would appriciate if maybe jingwech can try it on his system like he tested V2.

Andrey Grodzovsky (12):
   drm/amdgpu: Introduce reset domain
   drm/amdgpu: Move scheduler init to after XGMI is ready
   drm/amdgpu: Fix crash on modprobe
   drm/amdgpu: Serialize non TDR gpu recovery with TDRs
   drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
   drm/amdgpu: Drop hive->in_reset
   drm/amdgpu: Drop concurrent GPU reset protection for device
   drm/amdgpu: Rework reset domain to be refcounted.
   drm/amdgpu: Move reset sem into reset_domain
   drm/amdgpu: Move in_gpu_reset into reset_domain
   drm/amdgpu: Rework amdgpu_device_lock_adev
   Revert 'drm/amdgpu: annotate a false positive recursive locking'

  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  15 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  10 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 275 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  43 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|  18 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  39 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  12 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |   2 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  24 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   6 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  14 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  19 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  19 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c |  11 +-
  16 files changed, 313 insertions(+), 199 deletions(-)



[RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-01-25 Thread Andrey Grodzovsky
This patchset is based on earlier work by Boris[1] that allowed to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. On top of that I also serialized
any GPU reset we trigger from within amdgpu code to also go through the same
ordered wq and in this way simplify somewhat our GPU reset code so we don't need
to protect from concurrency by multiple GPU reset triggeres such as TDR on one
hand and sysfs trigger or RAS trigger on the other hand.

As advised by Christian and Daniel I defined a reset_domain struct such that
all the entities that go through reset together will be serialized one against
another. 

TDR triggered by multiple entities within the same domain due to the same 
reason will not
be triggered as the first such reset will cancel all the pending resets. This is
relevant only to TDR timers and not to triggered resets coming from RAS or 
SYSFS,
those will still happen after the in flight resets finishes.

v2:
Add handling on SRIOV configuration, the reset notify coming from host 
and driver already trigger a work queue to handle the reset so drop this
intermediate wq and send directly to timeout wq. (Shaoyun)

v3:
Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct.
I followed his advise and also moved adev->reset_sem into same place. This
in turn caused to do some follow-up refactor of the original patches 
where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive 
is destroyed and 
reconstructed for the case of reset the devices in the XGMI hive during probe 
for SRIOV See [2]
while we need the reset sem and gpu_reset flag to always be present. This was 
attained
by adding refcount to amdgpu_reset_domain so each device can safely point to it 
as long as
it needs.


[1] 
https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/
[2] https://www.spinics.net/lists/amd-gfx/msg58836.html

P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work 
hasn't landed yet there.

P.P.S Patches 8-12 are the refactor on top of the original V2 patchset.

P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system 
because drm-misc-next fails to load there.
Would appriciate if maybe jingwech can try it on his system like he tested V2.

Andrey Grodzovsky (12):
  drm/amdgpu: Introduce reset domain
  drm/amdgpu: Move scheduler init to after XGMI is ready
  drm/amdgpu: Fix crash on modprobe
  drm/amdgpu: Serialize non TDR gpu recovery with TDRs
  drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.
  drm/amdgpu: Drop hive->in_reset
  drm/amdgpu: Drop concurrent GPU reset protection for device
  drm/amdgpu: Rework reset domain to be refcounted.
  drm/amdgpu: Move reset sem into reset_domain
  drm/amdgpu: Move in_gpu_reset into reset_domain
  drm/amdgpu: Rework amdgpu_device_lock_adev
  Revert 'drm/amdgpu: annotate a false positive recursive locking'

 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  15 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 275 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  43 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|  18 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c |  39 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  12 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c  |  24 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h  |   3 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   6 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  14 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |  19 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |  19 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c |  11 +-
 16 files changed, 313 insertions(+), 199 deletions(-)

-- 
2.25.1