amdkfd: Add PCIe Hotplug Support for AMDKFD

Andrey Grodzovsky Wed, 27 Apr 2022 09:04:15 -0700

On 2022-04-27 05:20, Shuotao Xu wrote:

Hi Andrey,
Sorry that I did not have time to work on this for a few days.
I just tried the sysfs crash fix on Radeon VII and it seems that itworked. It did not pass last the hotplug test, but my version has 4tests instead of 3 in your case.

That because the 4th one is only enabled when here are 2 cards in thesystem - to test DRI_PRIME export. I tested this time with only one card.

Suite: Hotunplug Tests
Test: Unplug card and rescan the bus to plug it back.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Same as first test but with command submission.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
passed
Test: Unplug with exported bo .../usr/local/share/libdrm/amdgpu.ids:No such file or directory
passed
Test: Unplug with exported fence.../usr/local/share/libdrm/amdgpu.ids: No such file or directory
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

on the kernel side - the IOCTlL returning this is drm_getclient - maybetake a look while it can't find client it ? I didn't have such issue asfar as I remember when testing.

FAILED
    1. ../tests/amdgpu/hotunplug_tests.c:368  - CU_ASSERT_EQUAL(r,0)
2. ../tests/amdgpu/hotunplug_tests.c:411 -CU_ASSERT_EQUAL(amdgpu_cs_import_syncobj(device2, shared_fd,&sync_obj_handle2),0) 3. ../tests/amdgpu/hotunplug_tests.c:423 -CU_ASSERT_EQUAL(amdgpu_cs_syncobj_wait(device2, &sync_obj_handle2, 1,100000000, 0, NULL),0) 4. ../tests/amdgpu/hotunplug_tests.c:425 -CU_ASSERT_EQUAL(amdgpu_cs_destroy_syncobj(device2, sync_obj_handle2),0)
Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a      0      0
               tests     71      4      3      1      0
             asserts     39     39     35      4    n/a

Elapsed time =   17.321 seconds
For kfd compute, there is some problem which I did not see in MI100after I killed the hung application after hot plugout. I was usingrocm5.0.2 driver for MI100 card, and not sure if it is a regressionfrom the newer driver.After pkill, one of child of user process would be stuck in Zombiemode (Z) understandably because of the bug, and future rocmapplication after plug-back would in uninterrupted sleep mode (D)because it would not return from syscall to kfd.
Although drm test for amdgpu would run just fine without issues afterplug-back with dangling kfd state.

I am not clear when the crash bellow happens ? Is it related to what youdescribe above ?

I don’t know if there is a quick fix to it. I was thinking adddrm_enter/drm_exit to amdgpu_device_rreg.

Try adding drm_dev_enter/exit pair at the highest level of attmetong toaccess HW - in this case it's amdgpu_amdkfd_set_compute_idle. We alwaystry to avoid accessing any HW functions after backing device is gone.

Also this has been a long time in my attempt to fix hotplug issue forkfd application.I don’t know 1) if I would be able to get to MI100 (fixing Radeon VIIwould mean something but MI100 is more important for us); 2) what thedirect of the patch to this issue will move forward.

I will go to office tomorrow to pick up MI-100, With time and prioritiespermitting I will then then try to test it and fix any bugs such that itwill be passing all hot plug libdrm tests at the tip of publicamd-staging-drm-next - https://gitlab.freedesktop.org/agd5f/linux, afterthat you can try to continue working with ROCm enabling on top of that.

For now i suggest you move on with Radeon 7 which as your developmentASIC and use the fix i mentioned above.


Andrey

Regards,
Shuotao
[ +0.001645] BUG: unable to handle page fault for address:0000000000058a68
[  +0.001298] #PF: supervisor read access in kernel mode
[  +0.001252] #PF: error_code(0x0000) - not-present page
[ +0.001248] PGD 8000000115806067 P4D 8000000115806067 PUD 109b2d067PMD 0
[  +0.001270] Oops: 0000 [#1] PREEMPT SMP PTI
[ +0.001256] CPU: 5 PID: 13818 Comm: tf_cnn_benchmar Tainted: G W E 5.16.0+ #3[ +0.001290] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS1.5.4 [FPGA Test BIOS] 10/002/2015
[  +0.001309] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[ +0.001562] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 4400 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 0900 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.002751] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[ +0.001388] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:00000000ffffffff[ +0.001402] RDX: 0000000000000000 RSI: 000000000001629a RDI:ffff8b0c9c840000[ +0.001418] RBP: ffffb58fac313948 R08: 0000000000000021 R09:0000000000000001[ +0.001421] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:0000000000058a68[ +0.001400] R13: 000000000001629a R14: 0000000000000000 R15:000000000001629a[ +0.001397] FS: 0000000000000000(0000) GS:ffff8b4b7fa80000(0000)knlGS:0000000000000000
[  +0.001411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.001405] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:00000000001706e0
[  +0.001422] Call Trace:
[  +0.001407]  <TASK>
[  +0.001391]  amdgpu_device_rreg+0x17/0x20 [amdgpu]
[  +0.001614]  amdgpu_cgs_read_register+0x14/0x20 [amdgpu]
[  +0.001735]  phm_wait_for_register_unequal.part.1+0x58/0x90 [amdgpu]
[  +0.001790]  phm_wait_for_register_unequal+0x1a/0x30 [amdgpu]
[  +0.001800]  vega20_wait_for_response+0x28/0x80 [amdgpu]
[  +0.001757]  vega20_send_msg_to_smc_with_parameter+0x21/0x110 [amdgpu]
[  +0.001838]  smum_send_msg_to_smc_with_parameter+0xcd/0x100 [amdgpu]
[  +0.001829]  ? kvfree+0x1e/0x30
[  +0.001462]  vega20_set_power_profile_mode+0x58/0x330 [amdgpu]
[  +0.001868]  ? kvfree+0x1e/0x30
[  +0.001462]  ? ttm_bo_release+0x261/0x370 [ttm]
[  +0.001467]  pp_dpm_switch_power_profile+0xc2/0x170 [amdgpu]
[  +0.001863]  amdgpu_dpm_switch_power_profile+0x6b/0x90 [amdgpu]
[  +0.001866]  amdgpu_amdkfd_set_compute_idle+0x1a/0x20 [amdgpu]
[  +0.001784]  kfd_dec_compute_active+0x2c/0x50 [amdgpu]
[  +0.001744]  process_termination_cpsch+0x2f9/0x3a0 [amdgpu]
[  +0.001728]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
[  +0.001730]  kfd_process_notifier_release+0x91/0xe0 [amdgpu]
[  +0.001718]  __mmu_notifier_release+0x77/0x1f0
[  +0.001411]  exit_mmap+0x1b5/0x200
[  +0.001396]  ? __switch_to+0x12d/0x3e0
[  +0.001388]  ? __switch_to_asm+0x36/0x70
[  +0.001372]  ? preempt_count_add+0x74/0xc0
[  +0.001364]  mmput+0x57/0x110
[  +0.001349]  do_exit+0x33d/0xc20
[  +0.001337]  ? _raw_spin_unlock+0x1a/0x30
[  +0.001346]  do_group_exit+0x43/0xa0
[  +0.001341]  get_signal+0x131/0x920
[  +0.001295]  arch_do_signal_or_restart+0xb1/0x870
[  +0.001303]  ? do_futex+0x125/0x190
[  +0.001285]  exit_to_user_mode_prepare+0xb1/0x1c0
[  +0.001282]  syscall_exit_to_user_mode+0x2a/0x40
[  +0.001264]  do_syscall_64+0x46/0xb0
[  +0.001236]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.001219] RIP: 0033:0x7f6aff1d2ad3
[  +0.001177] Code: Unable to access opcode bytes at RIP 0x7f6aff1d2aa9.
[ +0.001166] RSP: 002b:00007f6ab2029d20 EFLAGS: 00000246 ORIG_RAX:00000000000000ca[ +0.001170] RAX: fffffffffffffe00 RBX: 0000000004f542b0 RCX:00007f6aff1d2ad3[ +0.001168] RDX: 0000000000000000 RSI: 0000000000000080 RDI:0000000004f542d8[ +0.001162] RBP: 0000000004f542d4 R08: 0000000000000000 R09:0000000000000000[ +0.001152] R10: 0000000000000000 R11: 0000000000000246 R12:0000000004f542d8[ +0.001176] R13: 0000000000000000 R14: 0000000004f54288 R15:0000000000000000
[  +0.001152]  </TASK>
[ +0.001113] Modules linked in: veth amdgpu(E) nf_conntrack_netlinknfnetlink xfrm_user xt_addrtype br_netfilter xt_CHECKSUMiptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntracknf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filterip6_tables iptable_filter overlay esp6_offload esp6 esp4_offload esp4xfrm_algo intel_rapl_msr intel_rapl_common sb_edacx86_pkg_temp_thermal intel_powerclamp snd_hda_codec_hdmi snd_hda_intelipmi_ssif snd_intel_dspcfg coretemp snd_hda_codec kvm_intelsnd_hda_core snd_hwdep snd_pcm snd_timer snd kvm soundcore irqbypassftdi_sio usbserial input_leds iTCO_wdt iTCO_vendor_support joydevmei_me rapl lpc_ich intel_cstate mei ipmi_si ipmi_devintfipmi_msghandler mac_hid acpi_power_meter sch_fq_codel ib_iser rdma_cmiw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsiscsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_genericzstd_compress raid10 raid456[ +0.000102] async_raid6_recov async_memcpy async_pq async_xorasync_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear iommu_v2gpu_sched drm_ttm_helper mgag200 ttm drm_shmem_helper drm_kms_helpersyscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_pclmulhid_generic crc32_pclmul ghash_clmulni_intel usbhid uas aesni_intelcrypto_simd igb ahci hid drm usb_storage cryptd libahci dcamegaraid_sas i2c_algo_bit wmi [last unloaded: amdgpu]
[  +0.016626] CR2: 0000000000058a68
[  +0.001550] ---[ end trace ff90849fe0a8b3b4 ]---
[  +0.024953] RIP: 0010:amdgpu_device_rreg.part.24+0xa9/0xe0 [amdgpu]
[ +0.001814] Code: e8 8c 7d 02 00 65 ff 0d 65 e0 7f 3f 75 ae 0f 1f 4400 00 eb a7 83 e2 02 75 09 f6 87 10 69 01 00 10 75 0d 4c 03 a3 a0 0900 00 <45> 8b 24 24 eb 8a 4c 8d b7 b0 6b 01 00 4c 89 f7 e8 a2 4c 2e ca 85
[  +0.003255] RSP: 0018:ffffb58fac313928 EFLAGS: 00010202
[ +0.001641] RAX: ffffffffc09a4270 RBX: ffff8b0c9c840000 RCX:00000000ffffffff[ +0.001656] RDX: 0000000000000000 RSI: 000000000001629a RDI:ffff8b0c9c840000[ +0.001681] RBP: ffffb58fac313948 R08: 0000000000000021 R09:0000000000000001[ +0.001662] R10: ffffb58fac313b30 R11: ffffffff8c065b00 R12:0000000000058a68[ +0.001650] R13: 000000000001629a R14: 0000000000000000 R15:000000000001629a[ +0.001648] FS: 0000000000000000(0000) GS:ffff8b4b7fa80000(0000)knlGS:0000000000000000
[  +0.001668] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.001673] CR2: 0000000000058a68 CR3: 000000010a2c8001 CR4:00000000001706e0
[  +0.001740] Fixing recursive fault but reboot is needed!
On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky<andrey.grodzov...@amd.com> wrote:
I retested hot plug tests at the commit I mentioned bellow - looksok, my ASIC is Navi 10, I also tested using Vega 10 and older PolarisASICs (whatever i had at home at the time). It's possible there areextra issues in ASICs like ur which I didn't cover during tests.
andrey@andrey-test:~/drm$ sudo ./build/tests/amdgpu/amdgpu_test -s 13
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support VCE, suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


The ASIC NOT support UVD ENC, suite disabled.
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory


Don't support TMZ (trust memory zone), security suite disabled
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
/usr/local/share/libdrm/amdgpu.ids: No such file or directory
Peer device is not opened or has ASIC not supported by the suite,skip all Peer to Peer tests.
     CUnit - A unit testing framework for C - Version 2.1-3
http://cunit.sourceforge.net/


*Suite: Hotunplug Tests**
** Test: Unplug card and rescan the bus to plug it back.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed**
** Test: Same as first test but with command submission.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed**
** Test: Unplug with exported bo.../usr/local/share/libdrm/amdgpu.ids: No such file or directory**
**passed*

Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites     14      1    n/a 0        0
               tests     71      3      3 0        1
             asserts     21     21     21      0 n/a

Elapsed time =    9.195 seconds


Andrey

On 2022-04-20 11:44, Andrey Grodzovsky wrote:
The only one in Radeon 7 I see is the same sysfs crash we alreadyfixed so you can use the same fix. The MI 200 issue i haven't seenyet but I also haven't tested MI200 so never saw it before. Need totest when i get the time.
So try that fix with Radeon 7 again to see if you pass the tests(the warnings should all be minor issues).
Andrey


On 2022-04-20 05:24, Shuotao Xu wrote:
That a problem, latest working baseline I tested and confirmedpassing hotplug tests is this branch and commithttps://gitlab.freedesktop.org/agd5f/linux/-/commit/86e12a53b73135806e101142e72f3f1c0e6fa8e6which is amd-staging-drm-next. 5.14 was the branch we ups-reamedthe hotplug code but it had a lot of regressions over time due tonew changes (that why I added the hotplug test to try and catchthem early). It would be best to run this branch on mi-100 so wehave a clean baseline and only after confirming this particularbranch from this commits passes libdrm tests only then startadding the KFD specific addons. Another option if you can't workwith MI-100 and this branch is to try a different ASIC that doeswork with this branch (if possible).
Andrey
OK I tried both this commit and the HEAD of and-staging-drm-next ontwo GPUs( MI100 and Radeon VII) both did not pass hotplugout libdrmtest. I might be able to gain access to MI200, but I suspect itwould work.
I copied the complete dmesgs as follows. I highlighted the OOPSESfor you.
Radeon VII:

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

Reply via email to