Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-05-11 Thread Andrey Grodzovsky



On 2022-05-11 12:49, Felix Kuehling wrote:

Am 2022-05-11 um 09:49 schrieb Andrey Grodzovsky:





[snip]

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index f1a225a20719..4b789bec9670 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,16 +714,37 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+   struct kfd_process *p;
+   struct amdkfd_process_info *p_info;
+   unsigned int temp;
+
   if (!kfd->init_complete)
   return;

   /* for runtime suspend, skip locking kfd */
-   if (!run_pm) {
+   if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
   /* For first KFD device suspend all the KFD 
processes */

   if (atomic_inc_return(&kfd_locked) == 1)
   kfd_suspend_all_processes();
   }

+   if (drm_dev_is_unplugged(kfd->ddev)){
+   int idx = srcu_read_lock(&kfd_processes_srcu);
+   pr_debug("cancel restore_userptr_work\n");
+   hash_for_each_rcu(kfd_processes_table, temp, p,
kfd_processes) {
+   if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>= 0) {
+   p_info = p->kgd_process_info;
+   pr_debug("cancel processes, pid = %d
for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);


Is this really necessary? If it is, there are probably other workers,
e.g. related to our SVM code, that would need to be canceled as well.



I delete this and it seems to be OK. It was previously added to 
suppress restore_useptr_work which keeps updating PTE.

Now this is gone by Fix 3. Please let us know if it is OK:) @Felix


Sounds good to me.







+
+ /* send exception signals to the kfd
events waiting in user space */
+ kfd_signal_hw_exception_event(p->pasid);


This makes sense. It basically tells user mode that the application's
GPU state is lost due to a RAS error or a GPU reset, or now a GPU
hot-unplug.


The problem is that it cannot find an event with a type that matches 
HW_EXCEPTION_TYPE so it does **nothing** from the driver with the 
default parameter value of send_sigterm = false;
After all, if a “zombie” process (zombie in the sense it does not 
have a GPU dev) does not exit, kfd resources seems not been released 
properly and new kfd process cannot run after plug back.
(I still need to look hard into rocr/hsakmt/kfd driver code to 
understand the reason. At least I am seeing that the kfd topology 
won’t be cleaned up without process exiting, so that there would be 
a “zombie" kfd node in the topology, which may or may not cause 
issues in hsakmt).
@Felix Do you have suggestion/insight on this “zombie" process 
issue? @Andrey suggests it should be OK to have a “zombie” kfd 
process and a “zombie” kfd dev, and the new kfd process should be ok 
to run on the new kfd dev after plugback.



My experience with the graphic stack at least showed that. At least 
in a setup with 2 GPUs, if i remove a secondary GPU which had a 
rendering process on it, I could plug back the secondary GPU and 
start a new rendering process while the old zombie process was still 
present. It could be that in KFD case there are some obstacles to 
this that need to be resolved.


I think this may be related to how KFD is tracking GPU resources. Do 
we actually destroy the KFD device structure when the GPU is unplugged?



No, all the device hierarchy (drm_device, amdgpu_device and hence I 
assume kfd_device) is kept around until the last drm_put drops the 
refcount to 0 - which happens when the process dies and drops it's drm 
file descriptor.



If not, it's still tracking process resource usage of the hanging 
process. This may be a bigger issue here and the solution is probably 
quite involved because of how all the process and device structures 
are related to each other.


Normally the KFD process cleanup is triggered by an MMU notifier when 
the process address space is destroyed. 



Note that the only thing we do is to invalidate all MMIO mappings within 
all the processes that have the GPU mapped into their address space 
(amdgpu_pci_remove->...->amdgpu_device_unmap_mmio) - this will prevent 
the zombie
process from subsequently writing into physical addresses that are not 
assigned to the removed GPU anymore.


Andrey


The kfd_process structure is also reference counted. I'll need to 
check if there is a way to force-delete the KFD process structure when 
a GPU is unplugged. That's going to be tricky, because of how the KFD 
process struct ties together several GPUs.


Regards,
  Felix



Andrey




May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel 
restore_userptr_work
May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw 
exception to pasid = 0x800
May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-05-11 Thread Felix Kuehling

Am 2022-05-11 um 09:49 schrieb Andrey Grodzovsky:





[snip]

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index f1a225a20719..4b789bec9670 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -714,16 +714,37 @@ bool kfd_is_locked(void)

void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm)
{
+   struct kfd_process *p;
+   struct amdkfd_process_info *p_info;
+   unsigned int temp;
+
   if (!kfd->init_complete)
   return;

   /* for runtime suspend, skip locking kfd */
-   if (!run_pm) {
+   if (!run_pm && !drm_dev_is_unplugged(kfd->ddev)) {
   /* For first KFD device suspend all the KFD processes */
   if (atomic_inc_return(&kfd_locked) == 1)
   kfd_suspend_all_processes();
   }

+   if (drm_dev_is_unplugged(kfd->ddev)){
+   int idx = srcu_read_lock(&kfd_processes_srcu);
+   pr_debug("cancel restore_userptr_work\n");
+   hash_for_each_rcu(kfd_processes_table, temp, p,
kfd_processes) {
+   if (kfd_process_gpuidx_from_gpuid(p, kfd->id)
>= 0) {
+   p_info = p->kgd_process_info;
+   pr_debug("cancel processes, pid = %d
for gpu_id = %d", pid_nr(p_info->pid), kfd->id);
+ cancel_delayed_work_sync(&p_info->restore_userptr_work);


Is this really necessary? If it is, there are probably other workers,
e.g. related to our SVM code, that would need to be canceled as well.



I delete this and it seems to be OK. It was previously added to 
suppress restore_useptr_work which keeps updating PTE.

Now this is gone by Fix 3. Please let us know if it is OK:) @Felix


Sounds good to me.







+
+ /* send exception signals to the kfd
events waiting in user space */
+ kfd_signal_hw_exception_event(p->pasid);


This makes sense. It basically tells user mode that the application's
GPU state is lost due to a RAS error or a GPU reset, or now a GPU
hot-unplug.


The problem is that it cannot find an event with a type that matches 
HW_EXCEPTION_TYPE so it does **nothing** from the driver with the 
default parameter value of send_sigterm = false;
After all, if a “zombie” process (zombie in the sense it does not 
have a GPU dev) does not exit, kfd resources seems not been released 
properly and new kfd process cannot run after plug back.
(I still need to look hard into rocr/hsakmt/kfd driver code to 
understand the reason. At least I am seeing that the kfd topology 
won’t be cleaned up without process exiting, so that there would be a 
“zombie" kfd node in the topology, which may or may not cause issues 
in hsakmt).
@Felix Do you have suggestion/insight on this “zombie" process issue? 
@Andrey suggests it should be OK to have a “zombie” kfd process and a 
“zombie” kfd dev, and the new kfd process should be ok to run on the 
new kfd dev after plugback.



My experience with the graphic stack at least showed that. At least in 
a setup with 2 GPUs, if i remove a secondary GPU which had a rendering 
process on it, I could plug back the secondary GPU and start a new 
rendering process while the old zombie process was still present. It 
could be that in KFD case there are some obstacles to this that need 
to be resolved.


I think this may be related to how KFD is tracking GPU resources. Do we 
actually destroy the KFD device structure when the GPU is unplugged? If 
not, it's still tracking process resource usage of the hanging process. 
This may be a bigger issue here and the solution is probably quite 
involved because of how all the process and device structures are 
related to each other.


Normally the KFD process cleanup is triggered by an MMU notifier when 
the process address space is destroyed. The kfd_process structure is 
also reference counted. I'll need to check if there is a way to 
force-delete the KFD process structure when a GPU is unplugged. That's 
going to be tricky, because of how the KFD process struct ties together 
several GPUs.


Regards,
  Felix



Andrey




May 11 09:52:07 NETSYS26 kernel: [52604.845400] amdgpu: cancel 
restore_userptr_work
May 11 09:52:07 NETSYS26 kernel: [52604.845405] amdgpu: sending hw 
exception to pasid = 0x800
May 11 09:52:07 NETSYS26 kernel: [52604.845414] kfd kfd: amdgpu: 
Process 25894 (pasid 0x8001) got unhandled exception






+ kfd_signal_vm_fault_event(kfd, p->pasid, NULL);


This does not make sense. A VM fault indicates an access to a bad
virtual address by the GPU. If a debugger is attached to the process, it
notifies the debugger to investigate what went wrong. If the GPU is
gone, that doesn't make any sense. There is no GPU that could have
issued a bad memory request. And the debugger won't be happy either to
find a VM fault from a GPU that doesn't exist any more.


OK understood.



If the HW-exception event doesn't terminate your process, we may need to
look into how ROCr h

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-05-10 Thread Shuotao Xu


On May 11, 2022, at 4:31 AM, Felix Kuehling 
mailto:felix.kuehl...@amd.com>> wrote:

[Some people who received this message don't often get email from 
felix.kuehl...@amd.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification.]

Am 2022-05-10 um 07:03 schrieb Shuotao Xu:


On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky
mailto:andrey.grodzov...@amd.com>> wrote:

On 2022-04-27 05:20, Shuotao Xu wrote:

Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it
worked. It did not pass last the hotplug test, but my version has 4
tests instead of 3 in your case.


That because the 4th one is only enabled when here are 2 cards in the
system - to test DRI_PRIME export. I tested this time with only one card.

Yes, I only had one Radeon VII in my system, so this 4th test should
have been skipped. I am ignoring this issue.



Suite: Hotunplug Tests
Test: Unplug card and rescan the bus to plug it back
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Same as first test but with command submission
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Unplug with exported bo
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Unplug with exported fence
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)


on the kernel side - the IOCTlL returning this is drm_getclient -
maybe take a look while it can't find client it ? I didn't have such
issue as far as I remember when testing.


FAILED
1. ../tests/amdgpu/hotunplug_tests.c:368 - CU_ASSERT_EQUAL(r,0)
2. ../tests/amdgpu/hotunplug_tests.c:411 -
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,
&sync_obj_handle2),0)
3. ../tests/amdgpu/hotunplug_tests.c:423 -
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2,
1, 1, 0, NULL),0)
4. ../tests/amdgpu/hotunplug_tests.c:425 -
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)

Run Summary: Type Total Ran Passed Failed Inactive
suites 14 1 n/a 0 0
tests 71 4 3 1 0
asserts 39 39 35 4 n/a

Elapsed time = 17.321 seconds

For kfd compute, there is some problem which I did not see in MI100
after I killed the hung application after hot plugout. I was using
rocm5.0.2 driver for MI100 card, and not sure if it is a regression
from the newer driver.
After pkill, one of child of user process would be stuck in Zombie
mode (Z) understandably because of the bug, and future rocm
application after plug-back would in uninterrupted sleep mode (D)
because it would not return from syscall to kfd.

Although drm test for amdgpu would run just fine without issues
after plug-back with dangling kfd state.


I am not clear when the crash bellow happens ? Is it related to what
you describe above ?



I don’t know if there is a quick fix to it. I was thinking add
drm_enter/drm_exit to amdgpu_device_rreg.


Try adding drm_dev_enter/exit pair at the highest level of attmetong
to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We
always try to avoid accessing any HW functions after backing device
is gone.


Also this has been a long time in my attempt to fix hotplug issue
for kfd application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon
VII would mean something but MI100 is more important for us); 2)
what the direct of the patch to this issue will move forward.


I will go to office tomorrow to pick up MI-100, With time and
priorities permitting I will then then try to test it and fix any
bugs such that it will be passing all hot plug libdrm tests at the
tip of public amd-staging-drm-next
-https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux&data=05%7C01%7Cshuotaoxu%40microsoft.com%7C97faa63fd9a743a2982308da32c41ec4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637878115188634502%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzuHL2YOs2e5IDmJTfyC7y44mLVLhvod9jC9s0QMXww%3D&reserved=0,
 after that you can try
to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development
ASIC and use the fix i mentioned above.

I finally got some time to continue on kfd hotplug patch attempt.
The following patch seems to work for kfd hotplug on Radeon VII. After
hot plugout, the tf process exists because of vm fault.
A new tf process run without issues after plugback.

It has the following fixes.

1. ras sysfs regression;
2. skip setting compute idle after dev is plugged, otherwise it will
   try to write the pci bar thus driver fault
3. stops the actual work of invalidate memory map triggered by
   useptrs; (return false will trigger warning, so I returned true.
   Not sure if it is correct)
4. It sends exceptions to all th

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-05-10 Thread Felix Kuehling



Am 2022-05-10 um 07:03 schrieb Shuotao Xu:



On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky 
 wrote:


On 2022-04-27 05:20, Shuotao Xu wrote:


Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it 
worked. It did not pass last the hotplug test, but my version has 4 
tests instead of 3 in your case.



That because the 4th one is only enabled when here are 2 cards in the 
system - to test DRI_PRIME export. I tested this time with only one card.


Yes, I only had one Radeon VII in my system, so this 4th test should 
have been skipped. I am ignoring this issue.





Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed
  Test: Same as first test but with command submission 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed
  Test: Unplug with exported bo 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed
  Test: Unplug with exported fence 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)



on the kernel side - the IOCTlL returning this is drm_getclient - 
maybe take a look while it can't find client it ? I didn't have such 
issue as far as I remember when testing.




FAILED
    1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
    2. ../tests/amdgpu/hotunplug_tests.c:411  - 
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
&sync_obj_handle2),0)
    3. ../tests/amdgpu/hotunplug_tests.c:423  - 
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 
1, 1, 0, NULL),0)
    4. ../tests/amdgpu/hotunplug_tests.c:425  - 
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)


Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0        0
               tests     71      4      3      1        0
             asserts     39     39     35      4      n/a

Elapsed time =   17.321 seconds

For kfd compute, there is some problem which I did not see in MI100 
after I killed the hung application after hot plugout. I was using 
rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
from the newer driver.
After pkill, one of child of user process would be stuck in Zombie 
mode (Z) understandably because of the bug, and future rocm 
application after plug-back would in uninterrupted sleep mode (D) 
because it would not return from syscall to kfd.


Although drm test for amdgpu would run just fine without issues 
after plug-back with dangling kfd state.



I am not clear when the crash bellow happens ? Is it related to what 
you describe above ?





I don’t know if there is a quick fix to it. I was thinking add 
drm_enter/drm_exit to amdgpu_device_rreg.



Try adding drm_dev_enter/exit pair at the highest level of attmetong 
to access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We 
always try to avoid accessing any HW functions after backing device 
is gone.



Also this has been a long time in my attempt to fix hotplug issue 
for kfd application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon 
VII would mean something but MI100 is more important for us); 2) 
what the direct of the patch to this issue will move forward.



I will go to office tomorrow to pick up MI-100, With time and 
priorities permitting I will then then try to test it and fix any 
bugs such that it will be passing all hot plug libdrm tests at the 
tip of public amd-staging-drm-next 
-https://gitlab.freedesktop.org/agd5f/linux, after that you can try 
to continue working with ROCm enabling on top of that.


For now i suggest you move on with Radeon 7 which as your development 
ASIC and use the fix i mentioned above.



I finally got some time to continue on kfd hotplug patch attempt.
The following patch seems to work for kfd hotplug on Radeon VII. After 
hot plugout, the tf process exists because of vm fault.

A new tf process run without issues after plugback.

It has the following fixes.

 1. ras sysfs regression;
 2. skip setting compute idle after dev is plugged, otherwise it will
try to write the pci bar thus driver fault
 3. stops the actual work of invalidate memory map triggered by
useptrs; (return false will trigger warning, so I returned true.
Not sure if it is correct)
 4. It sends exceptions to all the events/signal that a “zombie”
process that are waiting for. (Not sure if the hw_exception is
worthwhile, it did not do anything in my case since there is such
event type associated with that process)

Please take a look and let me know if it acceptable.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c

index 1f8161cd507f..2f7858692067 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/driver

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-05-10 Thread Shuotao Xu


On Apr 28, 2022, at 12:04 AM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:


On 2022-04-27 05:20, Shuotao Xu wrote:

Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it worked. It 
did not pass last the hotplug test, but my version has 4 tests instead of 3 in 
your case.


That because the 4th one is only enabled when here are 2 cards in the system - 
to test DRI_PRIME export. I tested this time with only one card.

Yes, I only had one Radeon VII in my system, so this 4th test should have been 
skipped. I am ignoring this issue.



Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Same as first test but with command submission 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: No such 
file or directory
passed
  Test: Unplug with exported fence .../usr/local/share/libdrm/amdgpu.ids: No 
such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

on the kernel side - the IOCTlL returning this is drm_getclient - maybe take a 
look while it can't find client it ? I didn't have such issue as far as I 
remember when testing.


FAILED
1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
2. ../tests/amdgpu/hotunplug_tests.c:411  - 
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
&sync_obj_handle2),0)
3. ../tests/amdgpu/hotunplug_tests.c:423  - 
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 
1, 0, NULL),0)
4. ../tests/amdgpu/hotunplug_tests.c:425  - 
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)

Run Summary:Type  TotalRan Passed Failed Inactive
  suites 14  1n/a  00
   tests 71  4  3  10
 asserts 39 39 35  4  n/a

Elapsed time =   17.321 seconds

For kfd compute, there is some problem which I did not see in MI100 after I 
killed the hung application after hot plugout. I was using rocm5.0.2 driver for 
MI100 card, and not sure if it is a regression from the newer driver.
After pkill, one of child of user process would be stuck in Zombie mode (Z) 
understandably because of the bug, and future rocm application after plug-back 
would in uninterrupted sleep mode (D) because it would not return from syscall 
to kfd.

Although drm test for amdgpu would run just fine without issues after plug-back 
with dangling kfd state.


I am not clear when the crash bellow happens ? Is it related to what you 
describe above ?


I don’t know if there is a quick fix to it. I was thinking add 
drm_enter/drm_exit to amdgpu_device_rreg.


Try adding drm_dev_enter/exit pair at the highest level of attmetong to access 
HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always try to avoid 
accessing any HW functions after backing device is gone.


Also this has been a long time in my attempt to fix hotplug issue for kfd 
application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII would 
mean something but MI100 is more important for us); 2) what the direct of the 
patch to this issue will move forward.


I will go to office tomorrow to pick up MI-100, With time and priorities 
permitting I will then then try to test it and fix any bugs such that it will 
be passing all hot plug libdrm tests at the tip of public amd-staging-drm-next 
- 
https://gitlab.freedesktop.org/agd5f/linux,
 after that you can try to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your development ASIC and 
use the fix i mentioned above.

I finally got some time to continue on kfd hotplug patch attempt.
The following patch seems to work for kfd hotplug on Radeon VII. After hot 
plugout, the tf process exists because of vm fault.
A new tf process run without issues after plugback.

It has the following fixes.

  1.  ras sysfs regression;
  2.  skip setting compute idle after dev is plugged, otherwise it will try to 
write the pci bar thus driver fault
  3.  stops the actual work of invalidate memory map triggered by useptrs; 
(return false will trigger warning, so I returned true. Not sure if it is 
correct)
  4.  It sends exceptions to all the events/signal that a “zombie” process that 
are waiting for. (Not sure if the hw_exception is w

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-27 Thread Andrey Grodzovsky

On 2022-04-27 05:20, Shuotao Xu wrote:


Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it 
worked. It did not pass last the hotplug test, but my version has 4 
tests instead of 3 in your case.



That because the 4th one is only enabled when here are 2 cards in the 
system - to test DRI_PRIME export. I tested this time with only one card.





Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed
  Test: Same as first test but with command submission 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: 
No such file or directory

passed
  Test: Unplug with exported fence 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)



on the kernel side - the IOCTlL returning this is drm_getclient - maybe 
take a look while it can't find client it ? I didn't have such issue as 
far as I remember when testing.




FAILED
    1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
    2. ../tests/amdgpu/hotunplug_tests.c:411  - 
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
&sync_obj_handle2),0)
    3. ../tests/amdgpu/hotunplug_tests.c:423  - 
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 
1, 0, NULL),0)
    4. ../tests/amdgpu/hotunplug_tests.c:425  - 
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)


Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0      0
               tests     71      4      3      1      0
             asserts     39     39     35      4    n/a

Elapsed time =   17.321 seconds

For kfd compute, there is some problem which I did not see in MI100 
after I killed the hung application after hot plugout. I was using 
rocm5.0.2 driver for MI100 card, and not sure if it is a regression 
from the newer driver.
After pkill, one of child of user process would be stuck in Zombie 
mode (Z) understandably because of the bug, and future rocm 
application after plug-back would in uninterrupted sleep mode (D) 
because it would not return from syscall to kfd.


Although drm test for amdgpu would run just fine without issues after 
plug-back with dangling kfd state.



I am not clear when the crash bellow happens ? Is it related to what you 
describe above ?





I don’t know if there is a quick fix to it. I was thinking add 
drm_enter/drm_exit to amdgpu_device_rreg.



Try adding drm_dev_enter/exit pair at the highest level of attmetong to 
access HW - in this case it's amdgpu_amdkfd_set_compute_idle. We always 
try to avoid accessing any HW functions after backing device is gone.



Also this has been a long time in my attempt to fix hotplug issue for 
kfd application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII 
would mean something but MI100 is more important for us); 2) what the 
direct of the patch to this issue will move forward.



I will go to office tomorrow to pick up MI-100, With time and priorities 
permitting I will then then try to test it and fix any bugs such that it 
will be passing all hot plug libdrm tests at the tip of public 
amd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, after 
that you can try to continue working with ROCm enabling on top of that.


For now i suggest you move on with Radeon 7 which as your development 
ASIC and use the fix i mentioned above.


Andrey




Regards,
Shuotao

[  +0.001645] BUG: unable to handle page fault for address: 
00058a68

[  +0.001298] #PF: supervisor read access in kernel mode
[  +0.001252] #PF: error_code(0x) - not-present page
[  +0.001248] PGD 800115806067 P4D 800115806067 PUD 109b2d067 
PMD 0

[  +0.001270] Oops:  [#1] PREEMPT SMP PTI
[  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G       
 W   E     5.16.0+ #3
[  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 
1.5.4 [FPGA Test BIOS] 10/002/2015

[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 
00 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 
00 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85

[  +0.002751] RSP: 0018:b58fac313928 EFLAGS: 00010202
[  +0.001388] RAX: c09a4270 RBX: 8b0c9c84 RCX: 

[  +0.001402] RDX:  RSI: 0001629a RDI: 
8b0c9c84
[  +0.001418] RBP: b58fac313948 R08: 0021 R09: 
0001
[  +0.001421] R10: b58fac313b30 R11: 8c065b00 R12: 
00058a68
[  +0.001400] R13: 0001629a R14:  R15: 
0001629a
[  +0.001397] FS:  0

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-27 Thread Shuotao Xu
Hi Andrey,

Sorry that I did not have time to work on this for a few days.

I just tried the sysfs crash fix on Radeon VII and it seems that it worked. It 
did not pass last the hotplug test, but my version has 4 tests instead of 3 in 
your case.

root@NETSYS26:/home/shuotaoxu/workspace/drm/build# ./tests/amdgpu/amdgpu_test 
-s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCN, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support JPEG, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite, skip all Peer 
to Peer tests.


 CUnit - A unit testing framework for C - Version 2.1-3
 http://cunit.sourceforge.net/


Suite: Hotunplug Tests
  Test: Unplug card and rescan the bus to plug it back 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Same as first test but with command submission 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: No such 
file or directory
passed
  Test: Unplug with exported fence .../usr/local/share/libdrm/amdgpu.ids: No 
such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
FAILED
1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
2. ../tests/amdgpu/hotunplug_tests.c:411  - 
CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd, 
&sync_obj_handle2),0)
3. ../tests/amdgpu/hotunplug_tests.c:423  - 
CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1, 
1, 0, NULL),0)
4. ../tests/amdgpu/hotunplug_tests.c:425  - 
CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)

Run Summary:Type  TotalRan Passed Failed Inactive
  suites 14  1n/a  00
   tests 71  4  3  10
 asserts 39 39 35  4  n/a

Elapsed time =   17.321 seconds

For kfd compute, there is some problem which I did not see in MI100 after I 
killed the hung application after hot plugout. I was using rocm5.0.2 driver for 
MI100 card, and not sure if it is a regression from the newer driver.
After pkill, one of child of user process would be stuck in Zombie mode (Z) 
understandably because of the bug, and future rocm application after plug-back 
would in uninterrupted sleep mode (D) because it would not return from syscall 
to kfd.

Although drm test for amdgpu would run just fine without issues after plug-back 
with dangling kfd state.

I don’t know if there is a quick fix to it. I was thinking add 
drm_enter/drm_exit to amdgpu_device_rreg.
Also this has been a long time in my attempt to fix hotplug issue for kfd 
application.
I don’t know 1) if I would be able to get to MI100 (fixing Radeon VII would 
mean something but MI100 is more important for us); 2) what the direct of the 
patch to this issue will move forward.

Regards,
Shuotao

[  +0.001645] BUG: unable to handle page fault for address: 00058a68
[  +0.001298] #PF: supervisor read access in kernel mode
[  +0.001252] #PF: error_code(0x) - not-present page
[  +0.001248] PGD 800115806067 P4D 800115806067 PUD 109b2d067 PMD 0
[  +0.001270] Oops:  [#1] PREEMPT SMP PTI
[  +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: GW   E   
  5.16.0+ #3
[  +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 [FPGA 
Test BIOS] 10/002/2015
[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[  +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 44 00 00 eb 
a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 09 00 00 <45> 8b 24 24 
eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.002751] RSP: 0018:b58fac313928 EFLAGS: 00010202
[  +0.001388] RAX: c09a4270 RBX: 8b0c9c84 RCX: 
[  +0.001402] RDX:  RSI: 0001629a RDI: 8b0c9c84
[  +0.001418] RBP: b58fac313948 R08: 0021 R09: 0001
[  +0.001421] R10: b58fac313b30 R11: 8c065b00 R12: 00058a68
[  +0.001400] R13: 0001629a R14:  R15: 

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-20 Thread Andrey Grodzovsky
I retested hot plug tests at the commit I mentioned bellow - looks ok, 
my ASIC is Navi 10, I also tested using Vega 10 and older Polaris ASICs 
(whatever i had at home at the time). It's possible there are extra 
issues in ASICs like ur which I didn't cover during tests.


andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCE, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD ENC, suite disabled.
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite, skip 
all Peer to Peer tests.



 CUnit - A unit testing framework for C - Version 2.1-3
http://cunit.sourceforge.net/


*Suite: Hotunplug Tests**
**  Test: Unplug card and rescan the bus to plug it back 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**

**passed**
**  Test: Same as first test but with command submission 
.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**

**passed**
**  Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids: 
No such file or directory**

**passed*

Run Summary:    Type  Total    Ran Passed Failed Inactive
  suites 14  1    n/a  0    0
   tests 71  3  3  0    1
 asserts 21 21 21  0  n/a

Elapsed time =    9.195 seconds


Andrey

On 2022-04-20 11:44, Andrey Grodzovsky wrote:


The only one in Radeon 7 I see is the same sysfs crash we already 
fixed so you can use the same fix. The MI 200 issue i haven't seen yet 
but I also haven't tested MI200 so never saw it before. Need to test 
when i get the time.


So try that fix with Radeon 7 again to see if you pass the tests (the 
warnings should all be minor issues).


Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:


That a problem, latest working baseline I tested and confirmed 
passing hotplug tests is this branch and commit 
https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
hotplug code but it had a lot of regressions over time due to new 
changes (that why I added the hotplug test to try and catch them 
early). It would be best to run this branch on mi-100 so we have a 
clean baseline and only after confirming  this particular branch 
from this commits passes libdrm tests only then start adding the KFD 
specific addons. Another option if you can't work with MI-100 and 
this branch is to try a different ASIC that does work with this 
branch (if possible).


Andrey

OK I tried both this commit and the HEAD of and-staging-drm-next on 
two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
test. I might be able to gain access to MI200, but I suspect it would 
work.


I copied the complete dmesgs as follows. I highlighted the OOPSES for 
you.


Radeon VII:

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-20 Thread Andrey Grodzovsky
The only one in Radeon 7 I see is the same sysfs crash we already fixed 
so you can use the same fix. The MI 200 issue i haven't seen yet but I 
also haven't tested MI200 so never saw it before. Need to test when i 
get the time.


So try that fix with Radeon 7 again to see if you pass the tests (the 
warnings should all be minor issues).


Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:


That a problem, latest working baseline I tested and confirmed 
passing hotplug tests is this branch and commit 
https://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6 
which is amd-staging-drm-next. 5.14 was the branch we ups-reamed the 
hotplug code but it had a lot of regressions over time due to new 
changes (that why I added the hotplug test to try and catch them 
early). It would be best to run this branch on mi-100 so we have a 
clean baseline and only after confirming  this particular branch from 
this commits passes libdrm tests only then start adding the KFD 
specific addons. Another option if you can't work with MI-100 and 
this branch is to try a different ASIC that does work with this 
branch (if possible).


Andrey

OK I tried both this commit and the HEAD of and-staging-drm-next on 
two GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrm 
test. I might be able to gain access to MI200, but I suspect it would 
work.


I copied the complete dmesgs as follows. I highlighted the OOPSES for you.

Radeon VII:

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-19 Thread Felix Kuehling

Am 2022-04-19 um 12:01 schrieb Andrey Grodzovsky:

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {
/* MMU-notifier related fields */
atomic_t evicted_bos;
+atomic_t invalid;
struct delayed_work restore_userptr_work;
struct pid *pid;
 };
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, 
void **process_info,

info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
atomic_set(&info->evicted_bos, 0);
+atomic_set(&info->invalid, 0);
INIT_DELAYED_WORK(&info->restore_userptr_work,
 amdgpu_amdkfd_restore_userptr_worker);
@@ -2693,6 +2694,9 @@ static void 
amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)

struct mm_struct *mm;
int evicted_bos;
+if (atomic_read(&process_info->invalid))
+return;
+



Probably better  to again use drm_dev_enter/exit guard pair instead 
of this flag.





I don’t know if I could use drm_dev_enter/exit efficiently because a 
process can have multiple drm_dev open. And I don’t know how I can 
recover/refer drm_dev(s) efficiently in the worker function in order 
to use drm_dev_enter/exit.



I think that within the KFD code each kfd device belongs or points to 
one specific drm_device so I don't think this is a problem.


Sorry, I haven't been following this discussion in all its details. But 
I don't see why you need to check a flag in the worker. If the GPU is 
unplugged you already cancel any pending work. How is new work getting 
scheduled after the GPU is unplugged? Is it due to pending interrupts or 
something? Can you instead invalidate process_info->restore_userptr_work 
to prevent it from being scheduled again? Or add some check where it's 
scheduling the work, instead of in the worker.


Regards,
  Felix




Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-14 Thread Shuotao Xu


On Apr 14, 2022, at 1:31 AM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:



On 2022-04-13 12:03, Shuotao Xu wrote:


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half 
understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and 
are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some 
warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at 
.../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) 
amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT 
nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw 
intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl 
joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei 
intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter 
acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs 
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib 
ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm 
drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops 
crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper 
mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: GW  OE 
5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 
2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 
ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb 
cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:a8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: ffea RBX: 89980e792058 RCX: 
[  711.780147] RDX:  RSI: 89a8f9ad8870 RDI: 89a8f9ad8870
[  711.780148] RBP: a8100dd67c50 R08:  R09: fff99b18
[  711.780149] R10: a8100dd67bd0 R11: a8100dd67908 R12: 89980e792000
[  711.780151] R13: 89980e792058 R14: 89980e7921bc R15: dead0100
[  711.780152] FS:  () GS:89a8f9ac() 
knlGS:
[  711.780154] CS:  0010 DS:  ES:  CR0: 80050033
[  711.780156] CR2: 7ffddac6f71f CR3: 0030bb80a003 CR4: 007706e0
[  711.780157] DR0:  DR1:  DR2: 
[  711.780159] DR3:  DR6: fffe0ff0 DR7: 0400
[  711.780160] PKRU: 5554
[  

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-13 Thread Andrey Grodzovsky


On 2022-04-13 12:03, Shuotao Xu wrote:



On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
 wrote:


[Some people who received this message don't often get email 
fromandrey.grodzov...@amd.com. Learn why this is important 
athttp://aka.ms/LearnAboutSenderIdentification.]


On 2022-04-08 21:28, Shuotao Xu wrote:


On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
 wrote:


[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is important at 
http://aka.ms/LearnAboutSenderIdentification.]


On 2022-04-08 04:45, Shuotao Xu wrote:

Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.

I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only 
half understood your comment last time.


BTW, I tried hot-plugging out a GPU when rocm application is still 
running.
From dmesg, application is still trying to access the removed kfd 
device, and are met with some errors.



Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.


Application would hang and not exiting in this case.




Actually I tried kill -7 $pid, and the process exists. The dmesg has 
some warning though.


[  711.769977] WARNING: CPU: 23 PID: 344 at 
.../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) 
amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE 
iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter 
overlay binfmt_misc intel_rapl_msr i40iw intel_rapl_common skx_edac 
nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel rpcrdma 
kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl joydev 
acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei 
intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca 
acpi_power_meter acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm 
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi pci_stub 
ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear 
mlx5_ib ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit 
drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel sysfillrect uas hid_generic sysimgblt 
aesni_intel mlx5_core fb_sys_fops crypto_simd usbhid cryptd drm i40e 
pci_hyperv_intf usb_storage glue_helper mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: G        W 
 OE     5.11.0+ #1
[  711.779755] Hardware name: Supermicro 
SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.1 08/14/2018

[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 
0b e9 69 ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd 
f3 eb cf <0f> 0b eb cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 
00 55

[  711.780143] RSP: 0018:a8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: ffea RBX: 89980e792058 RCX: 

[  711.780147] RDX:  RSI: 89a8f9ad8870 RDI: 
89a8f9ad8870
[  711.780148] RBP: a8100dd67c50 R08:  R09: 
fff99b18
[  711.780149] R10: a8100dd67bd0 R11: a8100dd67908 R12: 
89980e792000
[  711.780151] R13: 89980e792058 R14: 89980e7921bc R15: 
dead0100
[  711.780152] FS:  () GS:89a8f9ac() 
knlGS:

[  711.780154] CS:  0010 DS:  ES:  CR0: 80050033
[  711.780156] CR2: 7ffddac6f71f CR3: 0030bb80a003 CR4: 
007706e0
[  711.780157] DR0:  DR1:  DR2: 

[  711.780159] DR3:  DR6: fffe0ff0 DR7: 
0400

[  711.780160] PKRU: 5554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.780357]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[  711.780543]  amdgpu_gem

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-13 Thread Shuotao Xu


On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 21:28, Shuotao Xu wrote:

On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky 
mailto:andrey.grodzov...@amd.com>> wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:
Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.
I assumed you read my comment last time, still you do same approach.
More in details bellow
Aha, I like your fix:) I was not familiar with drm APIs so just only half 
understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
From dmesg, application is still trying to access the removed kfd device, and 
are met with some errors.


Application us supposed to keep running, it holds the drm_device
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die
thus releasing the FD and the last
drm_device reference.

Application would hang and not exiting in this case.


Actually I tried kill -7 $pid, and the process exists. The dmesg has some 
warning though.

[  711.769977] WARNING: CPU: 23 PID: 344 at 
.../amdgpu-rocm5.0.2/src/amd/amdgpu/amdgpu_object.c:1336 
amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.770528] Modules linked in: amdgpu(OE) amdttm(OE) amd_sched(OE) 
amdkcl(OE) iommu_v2 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo 
xt_addrtype br_netfilter xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat 
nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT 
nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter overlay binfmt_misc intel_rapl_msr i40iw 
intel_rapl_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel rpcrdma kvm sunrpc ipmi_ssif ib_umad ib_ipoib rdma_ucm irqbypass rapl 
joydev acpi_ipmi input_leds intel_cstate ipmi_si ipmi_devintf mei_me mei 
intel_pch_thermal ipmi_msghandler ioatdma mac_hid lpc_ich dca acpi_power_meter 
acpi_pad sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp 
libiscsi scsi_transport_iscsi pci_stub ip_tables x_tables autofs4 btrfs 
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy 
async_pq async_xor async_tx xor
[  711.779359]  raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib 
ib_uverbs ib_core ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm 
drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
sysfillrect uas hid_generic sysimgblt aesni_intel mlx5_core fb_sys_fops 
crypto_simd usbhid cryptd drm i40e pci_hyperv_intf usb_storage glue_helper 
mlxfw hid ahci libahci wmi
[  711.779752] CPU: 23 PID: 344 Comm: kworker/23:1 Tainted: GW  OE 
5.11.0+ #1
[  711.779755] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 
2.1 08/14/2018
[  711.779756] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  711.779955] RIP: 0010:amdgpu_bo_release_notify+0x150/0x160 [amdgpu]
[  711.780141] Code: e8 b5 af 34 f4 e9 1f ff ff ff 48 39 c2 74 07 0f 0b e9 69 
ff ff ff 4c 89 e7 e8 3c b4 16 00 e9 5c ff ff ff e8 a2 ce fd f3 eb cf <0f> 0b eb 
cb e8 d7 02 34 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55
[  711.780143] RSP: 0018:a8100dd67c30 EFLAGS: 00010282
[  711.780145] RAX: ffea RBX: 89980e792058 RCX: 
[  711.780147] RDX:  RSI: 89a8f9ad8870 RDI: 89a8f9ad8870
[  711.780148] RBP: a8100dd67c50 R08:  R09: fff99b18
[  711.780149] R10: a8100dd67bd0 R11: a8100dd67908 R12: 89980e792000
[  711.780151] R13: 89980e792058 R14: 89980e7921bc R15: dead0100
[  711.780152] FS:  () GS:89a8f9ac() 
knlGS:
[  711.780154] CS:  0010 DS:  ES:  CR0: 80050033
[  711.780156] CR2: 7ffddac6f71f CR3: 0030bb80a003 CR4: 007706e0
[  711.780157] DR0:  DR1:  DR2: 
[  711.780159] DR3:  DR6: fffe0ff0 DR7: 0400
[  711.780160] PKRU: 5554
[  711.780161] Call Trace:
[  711.780163]  ttm_bo_release+0x2ae/0x320 [amdttm]
[  711.780169]  amdttm_bo_put+0x30/0x40 [amdttm]
[  711.78

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-11 Thread Andrey Grodzovsky



On 2022-04-08 21:28, Shuotao Xu wrote:



On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky  
wrote:

[Some people who received this message don't often get email from 
andrey.grodzov...@amd.com. Learn why this is important at 
http://aka.ms/LearnAboutSenderIdentification.]

On 2022-04-08 04:45, Shuotao Xu wrote:

Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
devices can open doors for many advanced applications in data center
in the next few years, such as for GPU resource
disaggregation. Current AMDKFD does not support hotplug out b/o the
following reasons:

1. During PCIe removal, decrement KFD lock which was incremented at
the beginning of hw fini; otherwise kfd_open later is going to
fail.

I assumed you read my comment last time, still you do same approach.
More in details bellow

Aha, I like your fix:) I was not familiar with drm APIs so just only half 
understood your comment last time.

BTW, I tried hot-plugging out a GPU when rocm application is still running.
 From dmesg, application is still trying to access the removed kfd device, and 
are met with some errors.



Application us supposed to keep running, it holds the drm_device 
reference as long as it has an open
FD to the device and final cleanup will come only after the app will die 
thus releasing the FD and the last

drm_device reference.


Application would hang and not exiting in this case.



For graphic apps what i usually see is a crash because of sigsev when 
the app tries to access
an unmapped MMIO region on the device. I haven't tested for compute 
stack and so there might
be something I haven't covered. Hang could mean for example waiting on a 
fence which is not being

signaled - please provide full dmesg from this case.



Do you have any good suggestions on how to fix it down the line? (HIP 
runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed



This just probably means trying to update PTEs after the physical device 
is gone - we usually avoid this by
first trying to do all HW shutdowns early before PCI remove completion 
but when it's really tricky by

protecting HW access sections with drm_dev_enter/exit scope.

For this particular error it would be the best to flush 
info->restore_userptr_work before the end of
amdgpu_pci_remove (rejecting  new process creation and calling 
cancel_delayed_work_sync(&process_info->restore_userptr_work) for all 
running processes)

somewhere in amdgpu_pci_remove.

Andrey




Really appreciate your help!

Best,
Shuotao
  

2. Remove redudant p2p/io links in sysfs when device is hotplugged
out.

3. New kfd node_id is not properly assigned after a new device is
added after a gpu is hotplugged out in a system. libhsakmt will
find this anomaly, (i.e. node_from !=  in iolinks),
when taking a topology_snapshot, thus returns fault to the rocm
stack.

-- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
-- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
5.16.0-kfd is unstable out of box for MI100.
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
  drivers/gpu/drm/amd/amdkfd/kfd_device.c| 13 +
  4 files changed, 26 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index c18c4be1e4ac..d50011bdb5c4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool 
run_pm)
  return r;
  }

+int amdgpu_amdkfd_resume_processes(void)
+{
+ return kgd2kfd_resume_processes();
+}
+
  int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
  {
  int r = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..803306e011c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
+int amdgpu_amdkfd_resume_processes(void);
  void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
  const void *ih_ring_entry);
  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
@@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
  void kgd2kfd_suspend(struct kfd_dev *kfd, bool

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-11 Thread Shuotao Xu



> On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky  
> wrote:
> 
> [Some people who received this message don't often get email from 
> andrey.grodzov...@amd.com. Learn why this is important at 
> http://aka.ms/LearnAboutSenderIdentification.]
> 
> On 2022-04-08 04:45, Shuotao Xu wrote:
>> Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU
>> devices can open doors for many advanced applications in data center
>> in the next few years, such as for GPU resource
>> disaggregation. Current AMDKFD does not support hotplug out b/o the
>> following reasons:
>> 
>> 1. During PCIe removal, decrement KFD lock which was incremented at
>>the beginning of hw fini; otherwise kfd_open later is going to
>>fail.
> 
> I assumed you read my comment last time, still you do same approach.
> More in details bellow

Aha, I like your fix:) I was not familiar with drm APIs so just only half 
understood your comment last time. 

BTW, I tried hot-plugging out a GPU when rocm application is still running.
>From dmesg, application is still trying to access the removed kfd device, and 
>are met with some errors.
Application would hang and not exiting in this case.

Do you have any good suggestions on how to fix it down the line? (HIP 
runtime/libhsakmt or driver)

[64036.631333] amdgpu: amdgpu_vm_bo_update failed
[64036.631702] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.640754] amdgpu: amdgpu_vm_bo_update failed
[64036.641120] amdgpu: validate_invalid_user_pages: update PTE failed
[64036.650394] amdgpu: amdgpu_vm_bo_update failed
[64036.650765] amdgpu: validate_invalid_user_pages: update PTE failed

Really appreciate your help!

Best,
Shuotao
 
> 
>> 
>> 2. Remove redudant p2p/io links in sysfs when device is hotplugged
>>out.
>> 
>> 3. New kfd node_id is not properly assigned after a new device is
>>added after a gpu is hotplugged out in a system. libhsakmt will
>>find this anomaly, (i.e. node_from !=  in iolinks),
>>when taking a topology_snapshot, thus returns fault to the rocm
>>stack.
>> 
>> -- This patch fixes issue 1; another patch by Mukul fixes issues 2&3.
>> -- Tested on a 4-GPU MI100 gpu nodes with kernel 5.13.0-kfd; kernel
>>5.16.0-kfd is unstable out of box for MI100.
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  5 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  7 +++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_device.c| 13 +
>>  4 files changed, 26 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index c18c4be1e4ac..d50011bdb5c4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -213,6 +213,11 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev, 
>> bool run_pm)
>>  return r;
>>  }
>> 
>> +int amdgpu_amdkfd_resume_processes(void)
>> +{
>> + return kgd2kfd_resume_processes();
>> +}
>> +
>>  int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev)
>>  {
>>  int r = 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index f8b9f27adcf5..803306e011c3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -140,6 +140,7 @@ void amdgpu_amdkfd_fini(void);
>>  void amdgpu_amdkfd_suspend(struct amdgpu_device *adev, bool run_pm);
>>  int amdgpu_amdkfd_resume_iommu(struct amdgpu_device *adev);
>>  int amdgpu_amdkfd_resume(struct amdgpu_device *adev, bool run_pm);
>> +int amdgpu_amdkfd_resume_processes(void);
>>  void amdgpu_amdkfd_interrupt(struct amdgpu_device *adev,
>>  const void *ih_ring_entry);
>>  void amdgpu_amdkfd_device_probe(struct amdgpu_device *adev);
>> @@ -347,6 +348,7 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd);
>>  void kgd2kfd_suspend(struct kfd_dev *kfd, bool run_pm);
>>  int kgd2kfd_resume_iommu(struct kfd_dev *kfd);
>>  int kgd2kfd_resume(struct kfd_dev *kfd, bool run_pm);
>> +int kgd2kfd_resume_processes(void);
>>  int kgd2kfd_pre_reset(struct kfd_dev *kfd);
>>  int kgd2kfd_post_reset(struct kfd_dev *kfd);
>>  void kgd2kfd_interrupt(struct kfd_dev *kfd, const void *ih_ring_entry);
>> @@ -393,6 +395,11 @@ static inline int kgd2kfd_resume(struct kfd_dev *kfd, 
>> bool run_pm)
>>  return 0;
>>  }
>> 
>> +static inline int kgd2kfd_resume_processes(void)
>> +{
>> + return 0;
>> +}
>> +
>>  static inline int kgd2kfd_pre_reset(struct kfd_dev *kfd)
>>  {
>>  return 0;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index fa4a9f13c922..5827b65b7489 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4004,6 +4004,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>  if (drm_dev_is_unplugged(adev_to_drm(adev)))
>>