On 1/18/23 15:03, Étienne Mollier wrote:
2. The symbol tracking needs to be reviewed by somebody more experienced
than me. I think that anything in the rocrand::detail or
rocrand_host::detail namespace should be marked optional, as those symbols
are not intended for use by library users.
Ideally they should not be exposed (by the mean the build flag
-fvisibility=hidden allows, but I'm not sure of implementation
details on upstream side to be honest).  If the symbols are not
part of the public interface but still referenced, but we are
sure they are unused by reverse dependencies, they probably can
be marked (optional).  The library soversion suggests the stable
part of the ABI should not have had a breakage, so I guess the
(optional) marker is fine.

After further thought, I've added a patch to hide the kernels by marking them as static and removed all optional symbols from the symbol tracking. The ABI is a lot easier to verify with all that junk hidden.

I wanted to take that opportunity to stabilize the test suite of
rocm-hipamd, but I'm currently failing on:

        test 103
                Start 103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst
        
        103: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/ipc/hipMultiProcIpcMem " 
" "--N" "4"
        103: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        103: Environment variables:
        103:  HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        103: Test timeout computed to be: 1500
        103: KFD does not support xnack mode query.
        103: ROCr must assume xnack is disabled.
        103: error: 'hipErrorInvalidDevicePointer'(17) from hipIpcGetMemHandle(&ipc_handle, 
ipc_offset_dptr) at /<<PKGBUILDDIR>>/hip/tests/src/ipc/hipMultiProcIpcMem.cpp:55
        103: error: API returned error code.
        103: error: TEST FAILED
        103:
        103/414 Test #103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst 
.......................................................................................Subprocess
 aborted***Exception: 792.07 sec

A later test then crashes:

        test 126
                Start 126: directed_tests/printf/hipPrintfManyWaves.tst
        
        126: Test command: 
/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/printf/hipPrintfManyWaves " 
"
        126: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        126: Environment variables:
        126:  HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        126: Test timeout computed to be: 1500
        126: KFD does not support xnack mode query.
        126: ROCr must assume xnack is disabled.
        126: Memory access fault by GPU node-1 (Agent handle: 0x562e8fbb1e00) 
on address (nil)(may not be exact address). Reason: DRAM ECC failure.
        126: Nearby memory map:
        126: 0x7f5497800000, 0x78c000, System
        126: 0x7f549ac00000, 0x100000, VRAM
        126: 0x7f549af00000, 0x80000, System
        126:
        126: PtrInfo:
        126:    Address: 
0x7f5497800000-0x7f5497f8c000/0x7f5497800000-0x7f5497f8c000
        126:    Size: 0x78c000
        126:    Type: 1
        126:    Owner: 0x562e8fbac4b0
        126:    CanAccess: 1
        126:            0x562e8fbb1e00
        126:    In block: 0x7f5497800000, 0x78c000
        126: PtrInfo:
        126:    Address: 
0x7f549ac00000-0x7f549ad00000/0x7f549ac00000-0x7f549ad00000
        126:    Size: 0x100000
        126:    Type: 1
        126:    Owner: 0x562e8fbb1e00
        126:    CanAccess: 1
        126:            0x562e8fbb1e00
        126:    In block: 0x7f549ac00000, 0x200000
        126: PtrInfo:
        126:    Address: 
0x7f549af00000-0x7f549af80000/0x7f549af00000-0x7f549af80000
        126:    Size: 0x80000
        126:    Type: 1
        126:    Owner: 0x562e8fbac4b0
        126:    CanAccess: 1
        126:            0x562e8fbb1e00
        126:    In block: 0x7f549af00000, 0x80000
        126: hipPrintfManyWaves: ./src/core/runtime/runtime.cpp:1276: static bool 
rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion `false && 
"GPU memory access fault."' failed.
        126/414 Test #126: directed_tests/printf/hipPrintfManyWaves.tst 
........................................................................................Subprocess
 aborted***Exception:   0.64 sec
That's unfortunate. It can be quite difficult to tell why these things fail.
About at the same time as #126 I get a kernel NULL pointer
dereference:

        amdgpu: sq_intr: error, se 2, data 0x25, sh 0, priv 0, wave_id 0, 
simd_id 0, cu_id 0, err_type 4
        amdgpu 0000:0b:00.0: amdgpu: RAS poison consumption, unmap queue flow 
succeeded: client id 10
        BUG: kernel NULL pointer dereference, address: 00000000000001b0
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP NOPTI
        CPU: 7 PID: 206 Comm: kworker/7:1H Not tainted 6.1.0-1-amd64 #1  Debian 
6.1.4-1
        Hardware name: Gigabyte Technology Co., Ltd. X570 UD/X570 UD, BIOS F3 
09/04/2019
        Workqueue: KFD IH interrupt_wq [amdgpu]
        RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
        Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48 
8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa 
66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
        RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
        RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
        RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
        RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
        R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
        R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
        FS:  0000000000000000(0000) GS:ffff8938debc0000(0000) 
knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         smu_get_ecc_info+0x1f/0x30 [amdgpu]
         amdgpu_dpm_get_ecc_info+0x39/0x60 [amdgpu]
         amdgpu_umc_do_page_retirement.constprop.0+0x38/0x170 [amdgpu]
         amdgpu_umc_poison_handler+0x64/0xb0 [amdgpu]
         amdgpu_amdkfd_ras_poison_consumption_handler+0x48/0x70 [amdgpu]
         interrupt_wq+0xcf/0x120 [amdgpu]
         process_one_work+0x1c7/0x380
         worker_thread+0x4d/0x380
         ? _raw_spin_lock_irqsave+0x23/0x50
         ? rescuer_thread+0x3a0/0x3a0
         kthread+0xe9/0x110
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork+0x22/0x30
         </TASK>
        Modules linked in: overlay cpufreq_userspace cpufreq_powersave 
cpufreq_ondemand cpufreq_conservative binfmt_misc nls_ascii nls_cp437 vfat fat 
intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd 
snd_hda_codec_realtek kvm snd_hda_codec_generic ledtrig_audio irqbypass 
snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 gpu_sched snd_hda_intel 
sha512_generic snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi video 
snd_hda_codec drm_display_helper snd_hda_core cec rc_core snd_hwdep aesni_intel 
snd_pcm drm_ttm_helper crypto_simd ttm cryptd snd_timer drm_kms_helper 
gigabyte_wmi rapl snd pcspkr ccp wmi_bmof i2c_algo_bit sp5100_tco watchdog 
k10temp soundcore rng_core evdev button acpi_cpufreq sg parport_pc ppdev lp drm 
parport fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs btrfs 
zstd_compress raid1 dm_raid raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx md_mod xor raid6_pq libcrc32c crc32c_generic dm_mod sd_mod 
hid_generic usbhid hid ahci nvme
         xhci_pci libahci xhci_hcd nvme_core libata r8169 t10_pi realtek 
mdio_devres crc32_pclmul crc64_rocksoft crc32c_intel crc64 usbcore libphy 
crc_t10dif scsi_mod i2c_piix4 crct10dif_generic scsi_common usb_common 
crct10dif_pclmul crct10dif_common wmi
        CR2: 00000000000001b0
        ---[ end trace 0000000000000000 ]---
        RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
        Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48 
8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa 
66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
        RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
        RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
        RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
        RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
        R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
        R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
        FS:  0000000000000000(0000) GS:ffff8938debc0000(0000) 
knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0

If there's a null pointer dereference in AMDGPU, I would presume that's a kernel bug. I don't think the HIP library has any special powers that another user program wouldn't.

This is the sort of thing where we would really benefit from a CI for the GPU code. We run the tests so infrequently that there's a lot of changes in the components and their dependencies between runs. If we reran the tests every time there was a relevant change, like a kernel update, it would help to narrow down the source of these problems. Also, if wishes were fishes, I'd never go hungry.

Thanks,
Cory Bloor

Reply via email to