On Mon, Nov 01 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote: > On Mon, Nov 01 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote: >> On Mon, Nov 01 2021, Martin Pieuchot <m...@openbsd.org> wrote: >>> On 31/10/21(Sun) 15:57, Jeremie Courreges-Anglas wrote: >>>> On Fri, Oct 08 2021, Jeremie Courreges-Anglas <j...@wxcvbn.org> wrote: >>>> > riscv64.ports was running dpb(1) with two other members in the build >>>> > cluster. A few minutes ago I found it in ddb(4). The report is short, >>>> > sadly, as the machine doesn't return from the 'bt' command. >>>> > >>>> > The machine is acting both as an NFS server and and NFS client. >>>> > >>>> > OpenBSD/riscv64 (riscv64.ports.openbsd.org) (console) >>>> > >>>> > login: panic: pool_anic:t: pol_ free l: p mod fiee liat m oxifief:c a2e >>>> > 07ff0ff fte21ade0 00f ifem c0d >>>> > 1 07f1f0ffcf2177 010=0 c16ce6 7x090xc52c ! >>>> > 0x9066d21 919 xc1521 >>>> > Stopped at panic+0xfe: addi a0,zero,256 TID PID UID >>>> > PR >>>> > FLAGS PFLAGS CPU COMMAND >>>> > 24243 43192 55 0x2 0 0 cc >>>> > *480349 52543 0 0x11 0 1 perl >>>> > 480803 72746 55 0x2 0 3 c++ >>>> > 366351 3003 55 0x2 0 2K c++ >>>> > panic() at panic+0xfa >>>> > panic() at pool_do_get+0x29a >>>> > pool_do_get() at pool_get+0x76 >>>> > pool_get() at pmap_enter+0x128 >>>> > pmap_enter() at uvm_fault_upper+0x1c2 >>>> > uvm_fault_upper() at uvm_fault+0xb2 >>>> > uvm_fault() at do_trap_user+0x120 >>>> > https://www.openbsd.org/ddb.html describes the minimum info required in >>>> > bug >>>> > reports. Insufficient info makes it difficult to find and fix bugs. >>>> > ddb{1}> bt >>>> > panic() at panic+0xfa >>>> > panic() at pool_do_get+0x29a >>>> > pool_do_get() at pool_get+0x76 >>>> > pool_get() at pmap_enter+0x128 >>>> > pmap_enter() at uvm_fault_upper+0x1c2 >>>> > uvm_fault_upper() at uvm_fault+0xb2 >>>> > uvm_fault() at do_trap_user+0x120 >>>> > do_trap_user() at cpu_exception_handler_user+0x7a >>>> > <hangs> >>>> >>>> Another panic on riscv64-1, a new board which doesn't have RTC/I2C >>>> problems anymore and is acting as a dpb(1) cluster member/NFS client. >>> >>> Why are both traces ending in pool_do_get()? Are CPU0 and CPU1 there at >>> the same time? >>> >>> This corruption as well as the one above arise in the top part of the >>> fault handler which already runs concurrently. Did you try putting >>> KERNEL_LOCK/UNLOCK() dances around uvm_fault() in trap.c? That could >>> help figure out if something is still unsafe in riscv64's pmap. > > I'll try that on the ports bulk build machines. After all, that's where > I hit most/all the panics and clang crashes. > >> On my riscv64 I did add locking around the two uvm_fault() calls as >> suggested, rebooted, then started building libcrypto and libssl and left >> the place. Sadly the box is now unreachable (panic?) and will stay as >> is for the next days. I'll get back to it on sunday. > > That was a bit premature, I finally managed to remotely connect to the > machine. No idea why I couldn't connect to it for so long. [...]
In the end I really tried to run a kernel with KERNEL_LOCK/UNLOCK added around uvm_fault() in trap.c, on all riscv64*.p machines. And the result was a crash of both riscv64-1.p and riscv64.p in less than two hours, something I never got before. So while kernel-locking uvm_fault() again didn't fix the crashes, maybe it pushed uvm into more consistent crashes? One way to know would be to start experiments, but I can't reboot those machines at will... :-/ -- jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7EE