On Mon, Nov 01 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote: > On Mon, Nov 01 2021, Martin Pieuchot <m...@openbsd.org> wrote: >> On 31/10/21(Sun) 15:57, Jeremie Courreges-Anglas wrote: >>> On Fri, Oct 08 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote: >>> > riscv64.ports was running dpb(1) with two other members in the build >>> > cluster. A few minutes ago I found it in ddb(4). The report is short, >>> > sadly, as the machine doesn't return from the 'bt' command. >>> > >>> > The machine is acting both as an NFS server and and NFS client. >>> > >>> > OpenBSD/riscv64 (riscv64.ports.openbsd.org) (console) >>> > >>> > login: panic: pool_anic:t: pol_ free l: p mod fiee liat m oxifief:c a2e >>> > 07ff0ff fte21ade0 00f ifem c0d >>> > 1 07f1f0ffcf2177 010=0 c16ce6 7x090xc52c ! >>> > 0x9066d21 919 xc1521 >>> > Stopped at panic+0xfe: addi a0,zero,256 TID PID UID >>> > PR >>> > FLAGS PFLAGS CPU COMMAND >>> > 24243 43192 55 0x2 0 0 cc >>> > *480349 52543 0 0x11 0 1 perl >>> > 480803 72746 55 0x2 0 3 c++ >>> > 366351 3003 55 0x2 0 2K c++ >>> > panic() at panic+0xfa >>> > panic() at pool_do_get+0x29a >>> > pool_do_get() at pool_get+0x76 >>> > pool_get() at pmap_enter+0x128 >>> > pmap_enter() at uvm_fault_upper+0x1c2 >>> > uvm_fault_upper() at uvm_fault+0xb2 >>> > uvm_fault() at do_trap_user+0x120 >>> > https://www.openbsd.org/ddb.html describes the minimum info required in >>> > bug >>> > reports. Insufficient info makes it difficult to find and fix bugs. >>> > ddb{1}> bt >>> > panic() at panic+0xfa >>> > panic() at pool_do_get+0x29a >>> > pool_do_get() at pool_get+0x76 >>> > pool_get() at pmap_enter+0x128 >>> > pmap_enter() at uvm_fault_upper+0x1c2 >>> > uvm_fault_upper() at uvm_fault+0xb2 >>> > uvm_fault() at do_trap_user+0x120 >>> > do_trap_user() at cpu_exception_handler_user+0x7a >>> > <hangs> >>> >>> Another panic on riscv64-1, a new board which doesn't have RTC/I2C >>> problems anymore and is acting as a dpb(1) cluster member/NFS client. >> >> Why are both traces ending in pool_do_get()? Are CPU0 and CPU1 there at >> the same time? >> >> This corruption as well as the one above arise in the top part of the >> fault handler which already runs concurrently. Did you try putting >> KERNEL_LOCK/UNLOCK() dances around uvm_fault() in trap.c? That could >> help figure out if something is still unsafe in riscv64's pmap.
I'll try that on the ports bulk build machines. After all, that's where I hit most/all the panics and clang crashes. > On my riscv64 I did add locking around the two uvm_fault() calls as > suggested, rebooted, then started building libcrypto and libssl and left > the place. Sadly the box is now unreachable (panic?) and will stay as > is for the next days. I'll get back to it on sunday. That was a bit premature, I finally managed to remotely connect to the machine. No idea why I couldn't connect to it for so long. Either a problem with the provider/router, or something wrong regarding riscv64, slaacd and the router? slaacd[28738]: sendmsg: Can't assign requested address seems to happen at each reboot. Will have to investigate. -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE