Hi Cory, Cordell Bloor, on 2023-01-18: > I've updated the rocrand package sources on Salsa to rocrand 5.3.3 and > transformed it into a MUT package. I've confirmed that the resulting library > works correctly using it to configure rocfft 5.4.2 (which was how I > discovered this bug originally).
Thanks for this, I begun doing the same yesterday, but shifted my attention to rocm-hipamd for the reason you mention below. > rocrand 5.3.3-1 just needs three things to be ready for upload: > > 1. The d/copyright file needs to be updated for the new version. Acknowledged, note for later: notably there is the inclusion of the hipRAND directory to check. (I'll be travelling this week end so won't be much reactive until next week.) > 2. The symbol tracking needs to be reviewed by somebody more experienced > than me. I think that anything in the rocrand::detail or > rocrand_host::detail namespace should be marked optional, as those symbols > are not intended for use by library users. Ideally they should not be exposed (by the mean the build flag -fvisibility=hidden allows, but I'm not sure of implementation details on upstream side to be honest). If the symbols are not part of the public interface but still referenced, but we are sure they are unused by reverse dependencies, they probably can be marked (optional). The library soversion suggests the stable part of the ABI should not have had a breakage, so I guess the (optional) marker is fine. > 3. The rocm-hipamd 5.2.3-3 package needs to be uploaded or > libclang-rt-15-dev must be added to the rocrand build dependencies. I wanted to take that opportunity to stabilize the test suite of rocm-hipamd, but I'm currently failing on: test 103 Start 103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst 103: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/ipc/hipMultiProcIpcMem " " "--N" "4" 103: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu 103: Environment variables: 103: HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu 103: Test timeout computed to be: 1500 103: KFD does not support xnack mode query. 103: ROCr must assume xnack is disabled. 103: error: 'hipErrorInvalidDevicePointer'(17) from hipIpcGetMemHandle(&ipc_handle, ipc_offset_dptr) at /<<PKGBUILDDIR>>/hip/tests/src/ipc/hipMultiProcIpcMem.cpp:55 103: error: API returned error code. 103: error: TEST FAILED 103: 103/414 Test #103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst .......................................................................................Subprocess aborted***Exception: 792.07 sec A later test then crashes: test 126 Start 126: directed_tests/printf/hipPrintfManyWaves.tst 126: Test command: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/printf/hipPrintfManyWaves " " 126: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu 126: Environment variables: 126: HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu 126: Test timeout computed to be: 1500 126: KFD does not support xnack mode query. 126: ROCr must assume xnack is disabled. 126: Memory access fault by GPU node-1 (Agent handle: 0x562e8fbb1e00) on address (nil)(may not be exact address). Reason: DRAM ECC failure. 126: Nearby memory map: 126: 0x7f5497800000, 0x78c000, System 126: 0x7f549ac00000, 0x100000, VRAM 126: 0x7f549af00000, 0x80000, System 126: 126: PtrInfo: 126: Address: 0x7f5497800000-0x7f5497f8c000/0x7f5497800000-0x7f5497f8c000 126: Size: 0x78c000 126: Type: 1 126: Owner: 0x562e8fbac4b0 126: CanAccess: 1 126: 0x562e8fbb1e00 126: In block: 0x7f5497800000, 0x78c000 126: PtrInfo: 126: Address: 0x7f549ac00000-0x7f549ad00000/0x7f549ac00000-0x7f549ad00000 126: Size: 0x100000 126: Type: 1 126: Owner: 0x562e8fbb1e00 126: CanAccess: 1 126: 0x562e8fbb1e00 126: In block: 0x7f549ac00000, 0x200000 126: PtrInfo: 126: Address: 0x7f549af00000-0x7f549af80000/0x7f549af00000-0x7f549af80000 126: Size: 0x80000 126: Type: 1 126: Owner: 0x562e8fbac4b0 126: CanAccess: 1 126: 0x562e8fbb1e00 126: In block: 0x7f549af00000, 0x80000 126: hipPrintfManyWaves: ./src/core/runtime/runtime.cpp:1276: static bool rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion `false && "GPU memory access fault."' failed. 126/414 Test #126: directed_tests/printf/hipPrintfManyWaves.tst ........................................................................................Subprocess aborted***Exception: 0.64 sec About at the same time as #126 I get a kernel NULL pointer dereference: amdgpu: sq_intr: error, se 2, data 0x25, sh 0, priv 0, wave_id 0, simd_id 0, cu_id 0, err_type 4 amdgpu 0000:0b:00.0: amdgpu: RAS poison consumption, unmap queue flow succeeded: client id 10 BUG: kernel NULL pointer dereference, address: 00000000000001b0 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 206 Comm: kworker/7:1H Not tainted 6.1.0-1-amd64 #1 Debian 6.1.4-1 Hardware name: Gigabyte Technology Co., Ltd. X570 UD/X570 UD, BIOS F3 09/04/2019 Workqueue: KFD IH interrupt_wq [amdgpu] RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu] Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48 8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa 66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202 RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000 RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180 RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8 R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000 R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18 FS: 0000000000000000(0000) GS:ffff8938debc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0 Call Trace: <TASK> smu_get_ecc_info+0x1f/0x30 [amdgpu] amdgpu_dpm_get_ecc_info+0x39/0x60 [amdgpu] amdgpu_umc_do_page_retirement.constprop.0+0x38/0x170 [amdgpu] amdgpu_umc_poison_handler+0x64/0xb0 [amdgpu] amdgpu_amdkfd_ras_poison_consumption_handler+0x48/0x70 [amdgpu] interrupt_wq+0xcf/0x120 [amdgpu] process_one_work+0x1c7/0x380 worker_thread+0x4d/0x380 ? _raw_spin_lock_irqsave+0x23/0x50 ? rescuer_thread+0x3a0/0x3a0 kthread+0xe9/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Modules linked in: overlay cpufreq_userspace cpufreq_powersave cpufreq_ondemand cpufreq_conservative binfmt_misc nls_ascii nls_cp437 vfat fat intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd snd_hda_codec_realtek kvm snd_hda_codec_generic ledtrig_audio irqbypass snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 gpu_sched snd_hda_intel sha512_generic snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi video snd_hda_codec drm_display_helper snd_hda_core cec rc_core snd_hwdep aesni_intel snd_pcm drm_ttm_helper crypto_simd ttm cryptd snd_timer drm_kms_helper gigabyte_wmi rapl snd pcspkr ccp wmi_bmof i2c_algo_bit sp5100_tco watchdog k10temp soundcore rng_core evdev button acpi_cpufreq sg parport_pc ppdev lp drm parport fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs btrfs zstd_compress raid1 dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod xor raid6_pq libcrc32c crc32c_generic dm_mod sd_mod hid_generic usbhid hid ahci nvme xhci_pci libahci xhci_hcd nvme_core libata r8169 t10_pi realtek mdio_devres crc32_pclmul crc64_rocksoft crc32c_intel crc64 usbcore libphy crc_t10dif scsi_mod i2c_piix4 crct10dif_generic scsi_common usb_common crct10dif_pclmul crct10dif_common wmi CR2: 00000000000001b0 ---[ end trace 0000000000000000 ]--- RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu] Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 ea 48 8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 48 fa 66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202 RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000 RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180 RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8 R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000 R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18 FS: 0000000000000000(0000) GS:ffff8938debc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0 I'm redoing a build without running the test suite for upload, but I had to forcefully reboot the workstation, so this doesn't feel ideal for now. Thankfully this doesn't seem to affect reverse dependencies test suites as far as I could witness so far. The kernel version for ulterior reference: $ uname -srv Linux 6.1.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 (2023-01-07) Have a nice day, :) -- Étienne Mollier <emoll...@emlwks999.eu> Fingerprint: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da Sent from /dev/tty1, please excuse my verbosity. On air: Spock's Beard - When She's Gone
signature.asc
Description: PGP signature