Hi Cory,

Cordell Bloor, on 2023-01-18:
> I've updated the rocrand package sources on Salsa to rocrand 5.3.3 and
> transformed it into a MUT package. I've confirmed that the resulting library
> works correctly using it to configure rocfft 5.4.2 (which was how I
> discovered this bug originally).

Thanks for this, I begun doing the same yesterday, but shifted
my attention to rocm-hipamd for the reason you mention below.

> rocrand 5.3.3-1 just needs three things to be ready for upload:
> 
> 1. The d/copyright file needs to be updated for the new version.

Acknowledged, note for later: notably there is the inclusion of
the hipRAND directory to check.  (I'll be travelling this week
end so won't be much reactive until next week.)

> 2. The symbol tracking needs to be reviewed by somebody more experienced
> than me. I think that anything in the rocrand::detail or
> rocrand_host::detail namespace should be marked optional, as those symbols
> are not intended for use by library users.

Ideally they should not be exposed (by the mean the build flag
-fvisibility=hidden allows, but I'm not sure of implementation
details on upstream side to be honest).  If the symbols are not
part of the public interface but still referenced, but we are
sure they are unused by reverse dependencies, they probably can
be marked (optional).  The library soversion suggests the stable
part of the ABI should not have had a breakage, so I guess the
(optional) marker is fine.

> 3. The rocm-hipamd 5.2.3-3 package needs to be uploaded or
> libclang-rt-15-dev must be added to the rocrand build dependencies.

I wanted to take that opportunity to stabilize the test suite of
rocm-hipamd, but I'm currently failing on:

        test 103
                Start 103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst
        
        103: Test command: 
/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/ipc/hipMultiProcIpcMem " " 
"--N" "4"
        103: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        103: Environment variables: 
        103:  HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        103: Test timeout computed to be: 1500
        103: KFD does not support xnack mode query.
        103: ROCr must assume xnack is disabled.
        103: error: 'hipErrorInvalidDevicePointer'(17) from 
hipIpcGetMemHandle(&ipc_handle, ipc_offset_dptr) at 
/<<PKGBUILDDIR>>/hip/tests/src/ipc/hipMultiProcIpcMem.cpp:55
        103: error: API returned error code.
        103: error: TEST FAILED
        103: 
        103/414 Test #103: directed_tests/ipc/hipMultiProcIpcMem--N4.tst 
.......................................................................................Subprocess
 aborted***Exception: 792.07 sec

A later test then crashes:

        test 126
                Start 126: directed_tests/printf/hipPrintfManyWaves.tst
        
        126: Test command: 
/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/directed_tests/printf/hipPrintfManyWaves 
" "
        126: Working Directory: /<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        126: Environment variables: 
        126:  HIP_PATH=/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu
        126: Test timeout computed to be: 1500
        126: KFD does not support xnack mode query.
        126: ROCr must assume xnack is disabled.
        126: Memory access fault by GPU node-1 (Agent handle: 0x562e8fbb1e00) 
on address (nil)(may not be exact address). Reason: DRAM ECC failure.
        126: Nearby memory map:
        126: 0x7f5497800000, 0x78c000, System
        126: 0x7f549ac00000, 0x100000, VRAM
        126: 0x7f549af00000, 0x80000, System
        126: 
        126: PtrInfo:
        126:    Address: 
0x7f5497800000-0x7f5497f8c000/0x7f5497800000-0x7f5497f8c000
        126:    Size: 0x78c000
        126:    Type: 1
        126:    Owner: 0x562e8fbac4b0
        126:    CanAccess: 1
        126:            0x562e8fbb1e00
        126:    In block: 0x7f5497800000, 0x78c000
        126: PtrInfo:
        126:    Address: 
0x7f549ac00000-0x7f549ad00000/0x7f549ac00000-0x7f549ad00000
        126:    Size: 0x100000
        126:    Type: 1
        126:    Owner: 0x562e8fbb1e00
        126:    CanAccess: 1
        126:            0x562e8fbb1e00
        126:    In block: 0x7f549ac00000, 0x200000
        126: PtrInfo:
        126:    Address: 
0x7f549af00000-0x7f549af80000/0x7f549af00000-0x7f549af80000
        126:    Size: 0x80000
        126:    Type: 1
        126:    Owner: 0x562e8fbac4b0
        126:    CanAccess: 1
        126:            0x562e8fbb1e00
        126:    In block: 0x7f549af00000, 0x80000
        126: hipPrintfManyWaves: ./src/core/runtime/runtime.cpp:1276: static 
bool rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion 
`false && "GPU memory access fault."' failed.
        126/414 Test #126: directed_tests/printf/hipPrintfManyWaves.tst 
........................................................................................Subprocess
 aborted***Exception:   0.64 sec

About at the same time as #126 I get a kernel NULL pointer
dereference:

        amdgpu: sq_intr: error, se 2, data 0x25, sh 0, priv 0, wave_id 0, 
simd_id 0, cu_id 0, err_type 4
        amdgpu 0000:0b:00.0: amdgpu: RAS poison consumption, unmap queue flow 
succeeded: client id 10
        BUG: kernel NULL pointer dereference, address: 00000000000001b0
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0 
        Oops: 0002 [#1] PREEMPT SMP NOPTI
        CPU: 7 PID: 206 Comm: kworker/7:1H Not tainted 6.1.0-1-amd64 #1  Debian 
6.1.4-1
        Hardware name: Gigabyte Technology Co., Ltd. X570 UD/X570 UD, BIOS F3 
09/04/2019
        Workqueue: KFD IH interrupt_wq [amdgpu]
        RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
        Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 
ea 48 8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 
48 fa 66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
        RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
        RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
        RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
        RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
        R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
        R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
        FS:  0000000000000000(0000) GS:ffff8938debc0000(0000) 
knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         smu_get_ecc_info+0x1f/0x30 [amdgpu]
         amdgpu_dpm_get_ecc_info+0x39/0x60 [amdgpu]
         amdgpu_umc_do_page_retirement.constprop.0+0x38/0x170 [amdgpu]
         amdgpu_umc_poison_handler+0x64/0xb0 [amdgpu]
         amdgpu_amdkfd_ras_poison_consumption_handler+0x48/0x70 [amdgpu]
         interrupt_wq+0xcf/0x120 [amdgpu]
         process_one_work+0x1c7/0x380
         worker_thread+0x4d/0x380
         ? _raw_spin_lock_irqsave+0x23/0x50
         ? rescuer_thread+0x3a0/0x3a0
         kthread+0xe9/0x110
         ? kthread_complete_and_exit+0x20/0x20
         ret_from_fork+0x22/0x30
         </TASK>
        Modules linked in: overlay cpufreq_userspace cpufreq_powersave 
cpufreq_ondemand cpufreq_conservative binfmt_misc nls_ascii nls_cp437 vfat fat 
intel_rapl_msr intel_rapl_common amdgpu edac_mce_amd kvm_amd 
snd_hda_codec_realtek kvm snd_hda_codec_generic ledtrig_audio irqbypass 
snd_hda_codec_hdmi ghash_clmulni_intel sha512_ssse3 gpu_sched snd_hda_intel 
sha512_generic snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi video 
snd_hda_codec drm_display_helper snd_hda_core cec rc_core snd_hwdep aesni_intel 
snd_pcm drm_ttm_helper crypto_simd ttm cryptd snd_timer drm_kms_helper 
gigabyte_wmi rapl snd pcspkr ccp wmi_bmof i2c_algo_bit sp5100_tco watchdog 
k10temp soundcore rng_core evdev button acpi_cpufreq sg parport_pc ppdev lp drm 
parport fuse efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs btrfs 
zstd_compress raid1 dm_raid raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx md_mod xor raid6_pq libcrc32c crc32c_generic dm_mod sd_mod 
hid_generic usbhid hid ahci nvme
         xhci_pci libahci xhci_hcd nvme_core libata r8169 t10_pi realtek 
mdio_devres crc32_pclmul crc64_rocksoft crc32c_intel crc64 usbcore libphy 
crc_t10dif scsi_mod i2c_piix4 crct10dif_generic scsi_common usb_common 
crct10dif_pclmul crct10dif_common wmi
        CR2: 00000000000001b0
        ---[ end trace 0000000000000000 ]---
        RIP: 0010:sienna_cichlid_get_ecc_info+0x8c/0xe0 [amdgpu]
        Code: e8 d9 cf 01 00 85 c0 0f 85 58 f4 2c 00 48 8b 83 18 01 00 00 48 89 
ea 48 8d b0 80 01 00 00 0f b7 48 10 48 83 c0 18 48 83 c2 20 <66> 89 4a e0 0f b7 
48 fa 66 89 4a e2 48 8b 48 e8 48 89 4a e8 48 8b
        RSP: 0018:ffff9bf540b17d30 EFLAGS: 00010202
        RAX: ffff891a4ae66018 RBX: ffff891a4c33f000 RCX: 0000000000000000
        RDX: 00000000000001d0 RSI: ffff891a4ae66180 RDI: ffff891a4ae66180
        RBP: 00000000000001b0 R08: 0000000000000000 R09: ffff9bf540b17ba8
        R10: 0000000000000003 R11: ffff89395f2f1c28 R12: ffff891a4c33f000
        R13: 0000000000000000 R14: ffff891a40e5a840 R15: ffff891a59ccce18
        FS:  0000000000000000(0000) GS:ffff8938debc0000(0000) 
knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000001b0 CR3: 000000011bcc2000 CR4: 0000000000350ee0

I'm redoing a build without running the test suite for upload,
but I had to forcefully reboot the workstation, so this doesn't
feel ideal for now.  Thankfully this doesn't seem to affect
reverse dependencies test suites as far as I could witness so
far.  The kernel version for ulterior reference:

        $ uname -srv
        Linux 6.1.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 (2023-01-07)

Have a nice day,  :)
-- 
Étienne Mollier <emoll...@emlwks999.eu>
Fingerprint:  8f91 b227 c7d6 f2b1 948c  8236 793c f67e 8f0d 11da
Sent from /dev/tty1, please excuse my verbosity.
On air: Spock's Beard - When She's Gone

Attachment: signature.asc
Description: PGP signature

Reply via email to