Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-09-12 Thread Mark Millard
On 2022-Sep-12, at 05:10, Dmitry Salychev  wrote:

> 
> Hi,
> 
> It seems that the recent 14-CURRENT/aarch64 (866e021) with DPAA2 drivers
> panics under network throughtput stress test in random places

3 of your examples get a signal handler called at the
exact same instruction:

#6  0x004ced5c in witness_lock

The parameters vary, as do the callers:

#7  0x0043a3a8 in __mtx_lock_flags
(twice)
vs.
#7  0x0047d4ec in callout_lock
(once)

Showing one more level, where all are distinct:

#8  0x007d60a8 in dpaa2_swp_enq_mult (swp=swp@entry=0xa056ca00, 
ed=ed@entry=0xbcda2c70, fd=fd@entry=0xbcda2df8, 
flags=flags@entry=0xbcda2c6c, frames_n=frames_n@entry=1) at 
/usr/src/sys/dev/dpaa2/dpaa2_swp.c:795
vs.
#8  0x00508f54 in soreceive_generic (so=0x00011d2c2200, psa=0x0, 
uio=, mp0=, controlp=0x0, flagsp=) 
at /usr/src/sys/kern/uipc_socket.c:2240
vs.
#8  callout_reset_sbt_on (c=0x0001121792c0, sbt=, 
prec=, ftn=0x0047d4ec , 
arg=0x000112179000, cpu=0, flags=256) at 
/usr/src/sys/kern/kern_timeout.c:962
(no address shown)

Perhaps looking at what the code at 0x004ced5c
(and before) is doing with what kinds of data would be
useful compared to the less frequent example signal
handler invocations. It is common to all 3 call-chains
above. If dumps for them are around, more than the code
might be able to be looked into.


> with
> unknown kernel exception 0 esr_el1 200 on Ten64 board (based on
> NXP's LS1088A, Cortex-A53), but the same code doesn't panic on HoneyComb
> (NXP LX2160A, Cortex-A72) even after ~10h long tests.
> 
> I've gathered some stack backtraces from ddb and kgdb (attached).
> Panic itself can easily be reproduced after several minutes from the
> start of the test. I've tried to change PCPU_PTR macro to use get_pcpu
> again (as discussed in the thread earlier), but it didn't help.
> 
> If you want to get your hands dirty, DPAA2 stuff I'm using is at
> https://github.com/mcusim/freebsd-src/tree/lx2160acex7-exp (branch is
> lx2160acex7-exp!)
> 
> Any ideas or places to check would be really helpful.



===
Mark Millard
marklmi at yahoo.com




Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-09-12 Thread Dmitry Salychev
(kgdb) bt
#0  breakpoint () at /usr/src/sys/arm64/include/cpufunc.h:36
#1  kdb_enter (why=, msg=) at 
/usr/src/sys/kern/subr_kdb.c:508
#2  0x00460268 in vpanic (fmt=, ap=...) at 
/usr/src/sys/kern/kern_shutdown.c:967
#3  0x00460018 in panic (fmt=0x12 ) at /usr/src/sys/kern/kern_shutdown.c:903
#4  0x0077c05c in do_el1h_sync (td=0xa06d8000, 
frame=0xb23a24c0) at /usr/src/sys/arm64/arm64/trap.c:532
#5  
#6  0x00760d40 in arm_gic_v3_intr (arg=0xa21ab000) at 
/usr/src/sys/arm64/arm64/gic_v3.c:505
#7  0x0074be4c in intr_irq_handler (tf=0xb23a26e0) at 
/usr/src/sys/kern/subr_intr.c:327
#8  0x0074be4c in intr_irq_handler (tf=0xa06d8000) at 
/usr/src/sys/kern/subr_intr.c:327
#9  
#10 cpu_idle (busy=) at /usr/src/sys/arm64/arm64/machdep.c:257
#11 0x00491ab4 in sched_idletd (dummy=dummy@entry=0x0) at 
/usr/src/sys/kern/sched_ule.c:3070
#12 0x00416440 in fork_exit (callout=0x0049163c , 
arg=0x0, frame=0xb23a2990) at /usr/src/sys/kern/kern_fork.c:1102
#13 

---

(kgdb) bt
#0  breakpoint () at /usr/src/sys/arm64/include/cpufunc.h:36
#1  kdb_enter (why=, msg=) at 
/usr/src/sys/kern/subr_kdb.c:508
#2  0x00460268 in vpanic (fmt=, ap=...) at 
/usr/src/sys/kern/kern_shutdown.c:967
#3  0x00460018 in panic (fmt=0x12 ) at /usr/src/sys/kern/kern_shutdown.c:903
#4  0x0077c05c in do_el1h_sync (td=0xa24b5600, 
frame=0xbcda29c0) at /usr/src/sys/arm64/arm64/trap.c:532
#5  
#6  0x004ced5c in witness_lock (lock=lock@entry=0xa056ca20, 
flags=8, file=file@entry=0x0093011e 
"/usr/src/sys/dev/dpaa2/dpaa2_swp.c", line=line@entry=795) at 
/usr/src/sys/kern/subr_witness.c:1505
#7  0x0043a3a8 in __mtx_lock_flags (c=0xa056ca38, opts=0, 
file=0x0093011e "/usr/src/sys/dev/dpaa2/dpaa2_swp.c", line=795) at 
/usr/src/sys/kern/kern_mutex.c:290
#8  0x007d60a8 in dpaa2_swp_enq_mult (swp=swp@entry=0xa056ca00, 
ed=ed@entry=0xbcda2c70, fd=fd@entry=0xbcda2df8, 
flags=flags@entry=0xbcda2c6c, frames_n=frames_n@entry=1) at 
/usr/src/sys/dev/dpaa2/dpaa2_swp.c:795
#9  0x007d445c in dpaa2_io_enq_multiple_fq (iodev=, 
fqid=, fd=0x0093011e, frames_n=1473863424) at 
/usr/src/sys/dev/dpaa2/dpaa2_io.c:343
#10 0x007d7e3c in DPAA2_SWP_ENQ_MULTIPLE_FQ (fqid=205, 
fd=0xbcda2df8, frames_n=1, dev=) at ./dpaa2_swp_if.h:45
#11 dpaa2_ni_tx (sc=, tx=, m=) at 
/usr/src/sys/dev/dpaa2/dpaa2_ni.c:2717
#12 dpaa2_ni_transmit (ifp=, m=) at 
/usr/src/sys/dev/dpaa2/dpaa2_ni.c:2374
#13 0x005878bc in ether_output_frame (ifp=ifp@entry=0xa2550800, 
m=0x8, m@entry=0xa00032552900) at /usr/src/sys/net/if_ethersubr.c:513
#14 0x005876f4 in ether_output (ifp=0xa2550800, 
m=0xa00032552900, dst=0xa00028010b68, ro=0xa00028010b48) at 
/usr/src/sys/net/if_ethersubr.c:436
#15 0x005dff2c in ip_output (m=0x0093011e, 
m@entry=0xa00032552900, opt=, ro=0xa00028010b48, 
flags=, imo=0x0, inp=0xa000280109f0) at 
/usr/src/sys/netinet/ip_output.c:793
#16 0x005f7204 in tcp_default_output (tp=0x000111f72590) at 
/usr/src/sys/netinet/tcp_output.c:1542
#17 0x005f0ca4 in tcp_output (tp=0x000111f72590) at 
/usr/src/sys/netinet/tcp_var.h:407
#18 tcp_do_segment (m=0xa08004b26d00, th=0x00011d8ec0e2, 
so=0x000111f27d80, tp=0x000111f72590, drop_hdrlen=52, tlen=, iptos=) at /usr/src/sys/netinet/tcp_input.c:3294
#19 0x005ee094 in tcp_input_with_port (mp=, 
offp=, proto=, port=0) at 
/usr/src/sys/netinet/tcp_input.c:1425
#20 0x005eee70 in tcp_input (mp=0xa056ca20, offp=0x8, 
proto=9634078) at /usr/src/sys/netinet/tcp_input.c:1520
#21 0x005dbed4 in ip_input (m=0x0) at 
/usr/src/sys/netinet/ip_input.c:838
#22 0x005a4110 in netisr_dispatch_src (proto=1, source=0, 
m=0xa08004b26d00) at /usr/src/sys/net/netisr.c:1153
#23 0x005a44c0 in netisr_dispatch (proto=5687840, m=0x0093011e) 
at /usr/src/sys/net/netisr.c:1244
#24 0x00587a64 in ether_demux (ifp=ifp@entry=0xa2550800, m=0x8) 
at /usr/src/sys/net/if_ethersubr.c:925
#25 0x005890e0 in ether_input_internal (ifp=0xa2550800, m=0x8) 
at /usr/src/sys/net/if_ethersubr.c:711
#26 ether_nh_input (m=) at /usr/src/sys/net/if_ethersubr.c:741
#27 0x005a4110 in netisr_dispatch_src (proto=proto@entry=5, source=0, 
m=m@entry=0xa08004b26d00) at /usr/src/sys/net/netisr.c:1153
#28 0x005a44c0 in netisr_dispatch (proto=5687840, proto@entry=5, 
m=0x0093011e, m@entry=0xa08004b26d00) at 
/usr/src/sys/net/netisr.c:1244
#29 0x00587f24 in ether_input (ifp=0xa2550800, 
m=0xa08004b26d00) at /usr/src/sys/net/if_ethersubr.c:832
#30 0x007dc3c8 in dpaa2_ni_rx (chan=0xbe8f5000, fq=, fd=0xbefb1360) at /usr/src/sys/dev/dpaa2/dpaa2_ni.c:2870
#31 0x007dbf40

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-08 Thread bob prohaska
On Mon, Mar 07, 2022 at 11:45:02AM -0500, Mark Johnston wrote:
> On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
> > 
> > > On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> > > ...
> > > A (the?) problem is that the compiler is treating "pc" as an alias
> > > for x18, but the rmlock code assumes that the pcpu pointer is loaded
> > > once, as it dereferences "pc" outside of the critical section.  On
> > > arm64, if a context switch occurs between the store at _rm_rlock+144 and
> > > the load at +152, and the thread is migrated to another CPU, then we'll
> > > end up using the wrong CPU ID in the rm->rm_writecpus test.
> > > 
> > > I suspect the problem is unique to arm64 as its get_pcpu()
> > > implementation is different from the others in that it doesn't use
> > > volatile-qualified inline assembly.  This has been the case since
> > > https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
> > >  
> > > 
> > > .
> > > 
> > > I haven't been able to reproduce any crashes running poudriere in an
> > > arm64 AWS instance, though.  Could you please try the patch below and
> > > confirm whether it fixes your panics?  I verified that the apparent
> > > problem described above is gone with the patch.
> > 
> > Alternatively (or additionally) we could do something like the following. 
> > There are only a few MI users of get_pcpu with the main place being in rm 
> > locks.
> > 
> > diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
> > index 09f6361c651c..59b890e5c2ea 100644
> > --- a/sys/arm64/include/pcpu.h
> > +++ b/sys/arm64/include/pcpu.h
> > @@ -58,7 +58,14 @@ struct pcpu;
> > 
> >  register struct pcpu *pcpup __asm ("x18");
> > 
> > -#defineget_pcpu()  pcpup
> > +static inline struct pcpu *
> > +get_pcpu(void)
> > +{
> > +   struct pcpu *pcpu;
> > +
> > +   __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
> > +   return (pcpu);
> > +}
> > 
> >  static inline struct thread *
> >  get_curthread(void)
> 
> Indeed, I think this is probably the best solution.

Just for fun I tried the patch on a Pi3 running -current, updated a day or two
prior. The patch applied, compiled and seemed to run acceptably, but when I 
left a -j2 -DWITH_META_MODE buildworld running it crashed overnight, reporting


login: panic: rm_rlock: recursed on non-recursive rmlock sysctl lock @ 
/usr/src/sys/kern/kern_sysctl.c:193

cpuid = 0
time = 1646720264
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
_rm_rlock_debug() at _rm_rlock_debug+0x214
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x140
sysctl_root() at sysctl_root+0x1ac
userland_sysctl() at userland_sysctl+0x140
sys___sysctl() at sys___sysctl+0x68
do_el0_sync() at do_el0_sync+0x520
handle_el0_sync() at handle_el0_sync+0x40
--- exception, esr 0x5600
KDB: enter: panic
[ thread pid 869 tid 100091 ]
Stopped at  kdb_enter+0x44: undefined   f902011f


I tried typing bt at the debugger prompt but got no more output. 

I've put the buildworld log file at
http://www.zefox.net/~fbsd/rpi3/crashes/20220307/

Hope this is of some use

bob prohaska





Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-08 Thread Andrew Turner


> On 7 Mar 2022, at 19:04, Mark Johnston  wrote:
> 
> On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard wrote:
>> 
>> 
>> On 2022-Mar-7, at 08:45, Mark Johnston  wrote:
>> 
>>> On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
 
> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> ...
> A (the?) problem is that the compiler is treating "pc" as an alias
> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> once, as it dereferences "pc" outside of the critical section.  On
> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> the load at +152, and the thread is migrated to another CPU, then we'll
> end up using the wrong CPU ID in the rm->rm_writecpus test.
> 
> I suspect the problem is unique to arm64 as its get_pcpu()
> implementation is different from the others in that it doesn't use
> volatile-qualified inline assembly.  This has been the case since
> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
>  
> 
> .
> 
> I haven't been able to reproduce any crashes running poudriere in an
> arm64 AWS instance, though.  Could you please try the patch below and
> confirm whether it fixes your panics?  I verified that the apparent
> problem described above is gone with the patch.
 
 Alternatively (or additionally) we could do something like the following. 
 There are only a few MI users of get_pcpu with the main place being in rm 
 locks.
 
 diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
 index 09f6361c651c..59b890e5c2ea 100644
 --- a/sys/arm64/include/pcpu.h
 +++ b/sys/arm64/include/pcpu.h
 @@ -58,7 +58,14 @@ struct pcpu;
 
 register struct pcpu *pcpup __asm ("x18");
 
 -#defineget_pcpu()  pcpup
 +static inline struct pcpu *
 +get_pcpu(void)
 +{
 +   struct pcpu *pcpu;
 +
 +   __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
 +   return (pcpu);
 +}
 
 static inline struct thread *
 get_curthread(void)
>>> 
>>> Indeed, I think this is probably the best solution.

I’ve pushed the above to git in ed3066342660 & will MFC in a few days.

> 
> Thinking a bit more, even with that patch, code like this may not behave
> the same on arm64 as on other platforms:
> 
> critical_enter();
> ptr = &PCPU_GET(foo);
> critical_exit();
> bar = *ptr;
> 
> since as far as I can see the compiler may translate it to
> 
> critical_enter();
> critical_exit();
> bar = PCPU_GET(foo);

If we think this will be a problem we could change the PCPU_PTR macro to use 
get_pcpu again, however I only see two places it’s used in the MI code in 
subr_witness.c and kern_clock.c. Neither of these appear to be problematic from 
a quick look as there are no critical sections, although I’m not familiar 
enough with the code to know for certain.

Andrew

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 09:54:26PM +0100, Ronald Klop wrote:
>  
> Van: Mark Johnston 
> Datum: maandag, 7 maart 2022 16:13
> Aan: Ronald Klop 
> CC: bob prohaska , Mark Millard , 
> freebsd-...@freebsd.org, freebsd-current 
> > I haven't been able to reproduce any crashes running poudriere in an
> > arm64 AWS instance, though.  Could you please try the patch below and
> > confirm whether it fixes your panics?  I verified that the apparent
> > problem described above is gone with the patch.
> > 
> > diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c
> > index 0cdcfb8fec62..e51c25136ae0 100644
> > --- a/sys/kern/kern_rmlock.c
> > +++ b/sys/kern/kern_rmlock.c
> > @@ -437,6 +437,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker 
> > *tracker, int trylock)
> >  {
> > struct thread *td = curthread;
> > struct pcpu *pc;
> > +   int cpuid;
> >  
> > if (SCHEDULER_STOPPED())
> > return (1);
> > @@ -452,6 +453,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker 
> > *tracker, int trylock)
> > atomic_interrupt_fence();
> >  
> > pc = get_pcpu();
> > +   cpuid = pc->pc_cpuid;
> > rm_tracker_add(pc, tracker);
> > sched_pin();
> >  
> > @@ -463,7 +465,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker 
> > *tracker, int trylock)
> >  * conditional jump.
> >  */
> > if (__predict_true(0 == (td->td_owepreempt |
> > -   CPU_ISSET(pc->pc_cpuid, &rm->rm_writecpus
> > +   CPU_ISSET(cpuid, &rm->rm_writecpus
> > return (1);
> >  
> > /* We do not have a read token and need to acquire one. */
> > 
> > 
> > 
> 
> Hi,
> 
> This patch paniced again:
> x0: a5a31500  
>
>   x1: a5a0e000
> 
>   x2:2
> 
>   x3: a00076c4e9a0
> 
>   x4:0
> 
>   x5:e672743c8f9e5
> 
>   x6:dc89f70500ab1
>   x7:   14
>   x8: a5a31518
>   x9:1
>  x10: a5a0e000
>  x11:0
>  x12:0
>  x13:a
>  x14: 1013e6b85a8ecbe4
>  x15: 1dce740d11a5
>  x16: 3ea86e2434bf
>  x17: fff2
>  x18: fe661800 (g_ctx + fcf9fa54)
>  x19: a00076c4e9a0
>  x20: fec39000 (g_ctx + fd577254)
>  x21:2
>  x22: 419b6090 (g_ctx + 402f42e4)
>  x23: 00c0b137 (lockstat_enabled + 0)
>  x24:  100
>  x25: 00c0b000 (version + a0)
>  x26: 00c0b000 (version + a0)
>  x27: 00c0b000 (version + a0)
>  x28:0
>  x29: fe661800 (g_ctx + fcf9fa54)
>   sp: fe661800
>   lr: 0154ea50 (zio_dva_throttle + 154)
>  elr: 0154ea80 (zio_dva_throttle + 184)
> spsr: 6045
>  far: 2b753286b0b8
> panic: Unknown kernel exception 0 esr_el1 200
> cpuid = 1
> time = 1646685857
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> do_el1h_sync() at do_el1h_sync+0x184
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x200
> zio_dva_throttle() at zio_dva_throttle+0x184
> zio_execute() at zio_execute+0x58
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f901c11f
> db>  

ZFS doesn't make use of rm locks as far as I can see, so this is a
little weird.  I reverted the original rmlock commit in main, so it may
be worth verifying that the problem really is gone before digging
deeper.  In other words, I'm a bit suspicious that this is a different
bug.



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Millard



On 2022-Mar-7, at 12:54, Ronald Klop  wrote:

> Van: Mark Johnston 
> Datum: maandag, 7 maart 2022 16:13
> Aan: Ronald Klop 
> CC: bob prohaska , Mark Millard , 
> freebsd-...@freebsd.org, freebsd-current 
> Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
> panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
> 28))
> 
> On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:
> > Dear Mark Johnston,
> >
> > I did some binary search in the kernels and came to the conclusion that 
> > https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
> >  still works and 
> > https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
> >  panics.
> >
> > I suspect your commit in 
> > https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
> >
> > Last panic:
> >
> > panic: vm_fault failed: 0046e708 error 1
> > cpuid = 1
> > time = 1646660058
> > KDB: stack backtrace:
> > db_trace_self() at db_trace_self
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> > vpanic() at vpanic+0x174
> > panic() at panic+0x44
> > data_abort() at data_abort+0x2e8
> > handle_el1h_sync() at handle_el1h_sync+0x10
> > --- exception, esr 0x9604
> > _rm_rlock_debug() at _rm_rlock_debug+0x8c
> > osd_get() at osd_get+0x5c
> > zio_execute() at zio_execute+0xf8
> > taskqueue_run_locked() at taskqueue_run_locked+0x178
> > taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> > fork_exit() at fork_exit+0x74
> > fork_trampoline() at fork_trampoline+0x14
> > KDB: enter: panic
> > [ thread pid 0 tid 100129 ]
> > Stopped at  kdb_enter+0x44: undefined   f902011f
> > db>
> >
> > A more recent kernel (912df91) still panics. See below.
> >
> > Do you have time to look into this? What can I provide in information to 
> > help?
> 
> Hmm.  So after my rmlock commits, we have the following disassembly for
> _rm_rlock() (with a few annotations added by me).  Note that the pcpu
> pointer is stored in register x18 by convention.
> 
>0x0046e304 <+0>: stp x29, x30, [sp, #-16]!
>0x0046e308 <+4>: mov x29, sp
>0x0046e30c <+8>: ldr x8, [x18]
>0x0046e310 <+12>:ldr x9, [x18]
>0x0046e314 <+16>:ldr x10, [x18]
>0x0046e318 <+20>:cmp x9, x10
>0x0046e31c <+24>:b.ne0x0046e3cc <_rm_rlock+200>  
> // b.any
>0x0046e320 <+28>:ldr x9, [x18]
>0x0046e324 <+32>:ldrhw9, [x9, #314]
>0x0046e328 <+36>:cbnzw9, 0x0046e3c0 <_rm_rlock+188>
>0x0046e32c <+40>:str wzr, [x1, #32]
>0x0046e330 <+44>:stp x0, x8, [x1, #16]
>0x0046e334 <+48>:ldrbw9, [x0, #10]
>0x0046e338 <+52>:tbz w9, #4, 0x0046e358 
> <_rm_rlock+84>
>0x0046e33c <+56>:ldr x9, [x18]
>0x0046e340 <+60>:ldr w10, [x9, #888]
>0x0046e344 <+64>:add w10, w10, #0x1
>0x0046e348 <+68>:str w10, [x9, #888]
>0x0046e34c <+72>:ldr x9, [x18]
>0x0046e350 <+76>:ldr w9, [x9, #888]
>0x0046e354 <+80>:cbz w9, 0x0046e3f4 <_rm_rlock+240>
>0x0046e358 <+84>:ldr w9, [x8, #1212]
>0x0046e35c <+88>:add x10, x18, #0x90
>0x0046e360 <+92>:add w9, w9, #0x1
>0x0046e364 <+96>:str w9, [x8, #1212]  <--- 
> critical_enter
>0x0046e368 <+100>:   str x10, [x1, #8]
>0x0046e36c <+104>:   ldr x9, [x18, #144]
>0x0046e370 <+108>:   str x9, [x1]
>0x0046e374 <+112>:   str x1, [x9, #8]
>0x0046e378 <+116>:   str x1, [x18, #144]
>0x0046e37c <+120>:   ldr x9, [x18]
>0x0046e380 <+124>:   ldr w10, [x9, #356]
>0x0046e384 <+128>:   add w10, w10, #0x1
>0x0046e388 <+132>:   str w10, [x9, #356]
>0x0046e38c <+136>:   ldr w9, [x8, #1212]
>0x0046e390 <+140>:   sub w9, w9, #0x1
>0x0046e394 <+144>:   str w9, [x8, #1212]  <--- 
> critical_exit
>0x0046e398 &

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Ronald Klop


Van: Mark Johnston 
Datum: maandag, 7 maart 2022 16:13
Aan: Ronald Klop 
CC: bob prohaska , Mark Millard , 
freebsd-...@freebsd.org, freebsd-current 
Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
28))


On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:
> Dear Mark Johnston,
>
> I did some binary search in the kernels and came to the conclusion that 
https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e 
still works and 
https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a 
panics.
>
> I suspect your commit in 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
>
> Last panic:
>
> panic: vm_fault failed: 0046e708 error 1
> cpuid = 1
> time = 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x9604
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f902011f
> db>
>
> A more recent kernel (912df91) still panics. See below.
>
> Do you have time to look into this? What can I provide in information to help?

Hmm.  So after my rmlock commits, we have the following disassembly for
_rm_rlock() (with a few annotations added by me).  Note that the pcpu
pointer is stored in register x18 by convention.

   0x0046e304 <+0>: stp x29, x30, [sp, #-16]!
   0x0046e308 <+4>: mov x29, sp
   0x0046e30c <+8>: ldr x8, [x18]
   0x0046e310 <+12>:ldr x9, [x18]
   0x0046e314 <+16>:ldr x10, [x18]
   0x0046e318 <+20>:cmp x9, x10
   0x0046e31c <+24>:b.ne0x0046e3cc <_rm_rlock+200>  // 
b.any
   0x0046e320 <+28>:ldr x9, [x18]
   0x0046e324 <+32>:ldrhw9, [x9, #314]
   0x0046e328 <+36>:cbnzw9, 0x0046e3c0 <_rm_rlock+188>
   0x0046e32c <+40>:str wzr, [x1, #32]
   0x0046e330 <+44>:stp x0, x8, [x1, #16]
   0x0046e334 <+48>:ldrbw9, [x0, #10]
   0x0046e338 <+52>:tbz w9, #4, 0x0046e358 
<_rm_rlock+84>
   0x0046e33c <+56>:ldr x9, [x18]
   0x0046e340 <+60>:ldr w10, [x9, #888]
   0x0046e344 <+64>:add w10, w10, #0x1
   0x0046e348 <+68>:str w10, [x9, #888]
   0x0046e34c <+72>:ldr x9, [x18]
   0x0046e350 <+76>:ldr w9, [x9, #888]
   0x0046e354 <+80>:cbz w9, 0x0046e3f4 <_rm_rlock+240>
   0x0046e358 <+84>:ldr w9, [x8, #1212]
   0x0046e35c <+88>:add x10, x18, #0x90
   0x0046e360 <+92>:add w9, w9, #0x1
   0x0046e364 <+96>:str w9, [x8, #1212]  <--- critical_enter
   0x0046e368 <+100>:   str x10, [x1, #8]
   0x0046e36c <+104>:   ldr x9, [x18, #144]
   0x0046e370 <+108>:   str x9, [x1]
   0x0046e374 <+112>:   str x1, [x9, #8]
   0x0046e378 <+116>:   str x1, [x18, #144]
   0x0046e37c <+120>:   ldr x9, [x18]
   0x0046e380 <+124>:   ldr w10, [x9, #356]
   0x0046e384 <+128>:   add w10, w10, #0x1
   0x0046e388 <+132>:   str w10, [x9, #356]
   0x0046e38c <+136>:   ldr w9, [x8, #1212]
   0x0046e390 <+140>:   sub w9, w9, #0x1
   0x0046e394 <+144>:   str w9, [x8, #1212]  <--- critical_exit
   0x0046e398 <+148>:   ldrbw8, [x8, #304]
   0x0046e39c <+152>:   ldr w9, [x18, #60]   <--- loading 
&pc->pc_cpuid
   ...

A (the?) problem is that the compiler is treating "pc" as an alias
for x18, but the rmlock code assumes that the pcpu pointer is loaded
once, as it dereferences "pc" outside of the critical section.  On
arm64, if a context switch occurs between the store at _rm_rlock+144 and
the load at +152, and the thread is migrated to another CPU, then we'll
end up using the wrong CPU ID in the rm-

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard wrote:
> 
> 
> On 2022-Mar-7, at 08:45, Mark Johnston  wrote:
> 
> > On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
> >> 
> >>> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> >>> ...
> >>> A (the?) problem is that the compiler is treating "pc" as an alias
> >>> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> >>> once, as it dereferences "pc" outside of the critical section.  On
> >>> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> >>> the load at +152, and the thread is migrated to another CPU, then we'll
> >>> end up using the wrong CPU ID in the rm->rm_writecpus test.
> >>> 
> >>> I suspect the problem is unique to arm64 as its get_pcpu()
> >>> implementation is different from the others in that it doesn't use
> >>> volatile-qualified inline assembly.  This has been the case since
> >>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
> >>>  
> >>> 
> >>> .
> >>> 
> >>> I haven't been able to reproduce any crashes running poudriere in an
> >>> arm64 AWS instance, though.  Could you please try the patch below and
> >>> confirm whether it fixes your panics?  I verified that the apparent
> >>> problem described above is gone with the patch.
> >> 
> >> Alternatively (or additionally) we could do something like the following. 
> >> There are only a few MI users of get_pcpu with the main place being in rm 
> >> locks.
> >> 
> >> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
> >> index 09f6361c651c..59b890e5c2ea 100644
> >> --- a/sys/arm64/include/pcpu.h
> >> +++ b/sys/arm64/include/pcpu.h
> >> @@ -58,7 +58,14 @@ struct pcpu;
> >> 
> >> register struct pcpu *pcpup __asm ("x18");
> >> 
> >> -#defineget_pcpu()  pcpup
> >> +static inline struct pcpu *
> >> +get_pcpu(void)
> >> +{
> >> +   struct pcpu *pcpu;
> >> +
> >> +   __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
> >> +   return (pcpu);
> >> +}
> >> 
> >> static inline struct thread *
> >> get_curthread(void)
> > 
> > Indeed, I think this is probably the best solution.

Thinking a bit more, even with that patch, code like this may not behave
the same on arm64 as on other platforms:

critical_enter();
ptr = &PCPU_GET(foo);
critical_exit();
bar = *ptr;

since as far as I can see the compiler may translate it to

critical_enter();
critical_exit();
bar = PCPU_GET(foo);

> Is this just partially reverting:
> 
> https://cgit.freebsd.org/src/commit/?id=63c858a04d56
> 
> If so, there might need to be comments about why the updated
> code is as it will be.
> 
> Looks like stable/13 picked up sensitivity to the get_pcpu
> details in rmlock in:
> 
> https://cgit.freebsd.org/src/commit/?h=stable/13&id=543157870da5
> 
> (a 2022-03-04 commit) and stable/13 also has the get_pcpu
> misdefinition in:
> 
> https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=stable/13&id=63c858a04d56
> 
> . So an MFC would be appropriate in order for aarch64
> to be reliable for any variations in get_pcpu in stable/13
> (and for 13.1 to be so as well).

I reverted the rmlock commit in stable/13 already.  Either get_pcpu()
will be fixed shortly or 13.1 will ship without the rmlock commit.



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Millard



On 2022-Mar-7, at 08:45, Mark Johnston  wrote:

> On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
>> 
>>> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
>>> ...
>>> A (the?) problem is that the compiler is treating "pc" as an alias
>>> for x18, but the rmlock code assumes that the pcpu pointer is loaded
>>> once, as it dereferences "pc" outside of the critical section.  On
>>> arm64, if a context switch occurs between the store at _rm_rlock+144 and
>>> the load at +152, and the thread is migrated to another CPU, then we'll
>>> end up using the wrong CPU ID in the rm->rm_writecpus test.
>>> 
>>> I suspect the problem is unique to arm64 as its get_pcpu()
>>> implementation is different from the others in that it doesn't use
>>> volatile-qualified inline assembly.  This has been the case since
>>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
>>>  
>>> 
>>> .
>>> 
>>> I haven't been able to reproduce any crashes running poudriere in an
>>> arm64 AWS instance, though.  Could you please try the patch below and
>>> confirm whether it fixes your panics?  I verified that the apparent
>>> problem described above is gone with the patch.
>> 
>> Alternatively (or additionally) we could do something like the following. 
>> There are only a few MI users of get_pcpu with the main place being in rm 
>> locks.
>> 
>> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
>> index 09f6361c651c..59b890e5c2ea 100644
>> --- a/sys/arm64/include/pcpu.h
>> +++ b/sys/arm64/include/pcpu.h
>> @@ -58,7 +58,14 @@ struct pcpu;
>> 
>> register struct pcpu *pcpup __asm ("x18");
>> 
>> -#defineget_pcpu()  pcpup
>> +static inline struct pcpu *
>> +get_pcpu(void)
>> +{
>> +   struct pcpu *pcpu;
>> +
>> +   __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
>> +   return (pcpu);
>> +}
>> 
>> static inline struct thread *
>> get_curthread(void)
> 
> Indeed, I think this is probably the best solution.

Is this just partially reverting:

https://cgit.freebsd.org/src/commit/?id=63c858a04d56

If so, there might need to be comments about why the updated
code is as it will be.

Looks like stable/13 picked up sensitivity to the get_pcpu
details in rmlock in:

https://cgit.freebsd.org/src/commit/?h=stable/13&id=543157870da5

(a 2022-03-04 commit) and stable/13 also has the get_pcpu
misdefinition in:

https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=stable/13&id=63c858a04d56

. So an MFC would be appropriate in order for aarch64
to be reliable for any variations in get_pcpu in stable/13
(and for 13.1 to be so as well).

===
Mark Millard
marklmi at yahoo.com




Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
> 
> > On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> > ...
> > A (the?) problem is that the compiler is treating "pc" as an alias
> > for x18, but the rmlock code assumes that the pcpu pointer is loaded
> > once, as it dereferences "pc" outside of the critical section.  On
> > arm64, if a context switch occurs between the store at _rm_rlock+144 and
> > the load at +152, and the thread is migrated to another CPU, then we'll
> > end up using the wrong CPU ID in the rm->rm_writecpus test.
> > 
> > I suspect the problem is unique to arm64 as its get_pcpu()
> > implementation is different from the others in that it doesn't use
> > volatile-qualified inline assembly.  This has been the case since
> > https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
> >  
> > 
> > .
> > 
> > I haven't been able to reproduce any crashes running poudriere in an
> > arm64 AWS instance, though.  Could you please try the patch below and
> > confirm whether it fixes your panics?  I verified that the apparent
> > problem described above is gone with the patch.
> 
> Alternatively (or additionally) we could do something like the following. 
> There are only a few MI users of get_pcpu with the main place being in rm 
> locks.
> 
> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
> index 09f6361c651c..59b890e5c2ea 100644
> --- a/sys/arm64/include/pcpu.h
> +++ b/sys/arm64/include/pcpu.h
> @@ -58,7 +58,14 @@ struct pcpu;
> 
>  register struct pcpu *pcpup __asm ("x18");
> 
> -#defineget_pcpu()  pcpup
> +static inline struct pcpu *
> +get_pcpu(void)
> +{
> +   struct pcpu *pcpu;
> +
> +   __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
> +   return (pcpu);
> +}
> 
>  static inline struct thread *
>  get_curthread(void)

Indeed, I think this is probably the best solution.



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Andrew Turner

> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> ...
> A (the?) problem is that the compiler is treating "pc" as an alias
> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> once, as it dereferences "pc" outside of the critical section.  On
> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> the load at +152, and the thread is migrated to another CPU, then we'll
> end up using the wrong CPU ID in the rm->rm_writecpus test.
> 
> I suspect the problem is unique to arm64 as its get_pcpu()
> implementation is different from the others in that it doesn't use
> volatile-qualified inline assembly.  This has been the case since
> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
>  
> 
> .
> 
> I haven't been able to reproduce any crashes running poudriere in an
> arm64 AWS instance, though.  Could you please try the patch below and
> confirm whether it fixes your panics?  I verified that the apparent
> problem described above is gone with the patch.

Alternatively (or additionally) we could do something like the following. There 
are only a few MI users of get_pcpu with the main place being in rm locks.

diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
index 09f6361c651c..59b890e5c2ea 100644
--- a/sys/arm64/include/pcpu.h
+++ b/sys/arm64/include/pcpu.h
@@ -58,7 +58,14 @@ struct pcpu;

 register struct pcpu *pcpup __asm ("x18");

-#defineget_pcpu()  pcpup
+static inline struct pcpu *
+get_pcpu(void)
+{
+   struct pcpu *pcpu;
+
+   __asm __volatile("mov   %0, x18" : "=&r"(pcpu));
+   return (pcpu);
+}

 static inline struct thread *
 get_curthread(void)

Andrew



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:
> Dear Mark Johnston,
> 
> I did some binary search in the kernels and came to the conclusion that 
> https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
>  still works and 
> https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
>  panics.
> 
> I suspect your commit in 
> https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
> 
> Last panic:
> 
> panic: vm_fault failed: 0046e708 error 1
> cpuid = 1
> time = 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x9604
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f902011f
> db>
> 
> A more recent kernel (912df91) still panics. See below.
> 
> Do you have time to look into this? What can I provide in information to help?

Hmm.  So after my rmlock commits, we have the following disassembly for
_rm_rlock() (with a few annotations added by me).  Note that the pcpu
pointer is stored in register x18 by convention.

   0x0046e304 <+0>: stp x29, x30, [sp, #-16]!
   0x0046e308 <+4>: mov x29, sp
   0x0046e30c <+8>: ldr x8, [x18]
   0x0046e310 <+12>:ldr x9, [x18]
   0x0046e314 <+16>:ldr x10, [x18]
   0x0046e318 <+20>:cmp x9, x10
   0x0046e31c <+24>:b.ne0x0046e3cc <_rm_rlock+200>  // 
b.any
   0x0046e320 <+28>:ldr x9, [x18]
   0x0046e324 <+32>:ldrhw9, [x9, #314]
   0x0046e328 <+36>:cbnzw9, 0x0046e3c0 <_rm_rlock+188>
   0x0046e32c <+40>:str wzr, [x1, #32]
   0x0046e330 <+44>:stp x0, x8, [x1, #16]
   0x0046e334 <+48>:ldrbw9, [x0, #10]
   0x0046e338 <+52>:tbz w9, #4, 0x0046e358 
<_rm_rlock+84>
   0x0046e33c <+56>:ldr x9, [x18]
   0x0046e340 <+60>:ldr w10, [x9, #888]
   0x0046e344 <+64>:add w10, w10, #0x1
   0x0046e348 <+68>:str w10, [x9, #888]
   0x0046e34c <+72>:ldr x9, [x18]
   0x0046e350 <+76>:ldr w9, [x9, #888]
   0x0046e354 <+80>:cbz w9, 0x0046e3f4 <_rm_rlock+240>
   0x0046e358 <+84>:ldr w9, [x8, #1212]
   0x0046e35c <+88>:add x10, x18, #0x90
   0x0046e360 <+92>:add w9, w9, #0x1
   0x0046e364 <+96>:str w9, [x8, #1212]  <--- critical_enter
   0x0046e368 <+100>:   str x10, [x1, #8]
   0x0046e36c <+104>:   ldr x9, [x18, #144]
   0x0046e370 <+108>:   str x9, [x1]
   0x0046e374 <+112>:   str x1, [x9, #8]
   0x0046e378 <+116>:   str x1, [x18, #144]
   0x0046e37c <+120>:   ldr x9, [x18]
   0x0046e380 <+124>:   ldr w10, [x9, #356]
   0x0046e384 <+128>:   add w10, w10, #0x1
   0x0046e388 <+132>:   str w10, [x9, #356]
   0x0046e38c <+136>:   ldr w9, [x8, #1212]
   0x0046e390 <+140>:   sub w9, w9, #0x1
   0x0046e394 <+144>:   str w9, [x8, #1212]  <--- critical_exit
   0x0046e398 <+148>:   ldrbw8, [x8, #304]
   0x0046e39c <+152>:   ldr w9, [x18, #60]   <--- loading 
&pc->pc_cpuid
   ...

A (the?) problem is that the compiler is treating "pc" as an alias
for x18, but the rmlock code assumes that the pcpu pointer is loaded
once, as it dereferences "pc" outside of the critical section.  On
arm64, if a context switch occurs between the store at _rm_rlock+144 and
the load at +152, and the thread is migrated to another CPU, then we'll
end up using the wrong CPU ID in the rm->rm_writecpus test.

I suspect the problem is unique to arm64 as its get_pcpu()
implementation is different from the others in that it doesn't use
volatile-qualified inline assembly.  This has been the case since
https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
.

I haven't been able to reproduce any crashes running poudriere in an
arm64 AWS instance, though.  Could you please try the patch below and
confirm whether it fixes your panics?  I verified that the apparent
problem described above is gone with the patch.

diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c
index 0cdcfb8fec62..e51c25136ae0 100644
--- a/sys/kern/kern_rmlock.c
+++ b/sys/kern/kern_rml

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Millard


On 2022-Mar-7, at 05:46, Ronald Klop  wrote:

> Dear Mark Johnston,
> 
> I did some binary search in the kernels and came to the conclusion that 
> https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
>  still works and 
> https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
>  panics.
> 
> I suspect your commit in 
> https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
> 
> Last panic:
> 
> panic: vm_fault failed: 0046e708 error 1
> cpuid = 1
> time = 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x9604
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f902011f
> db>

Was this a WITNESS/DEBUG kernel? Non-WITNESS? Non-debug?

Which aarch64 variant? Bob's was Cortex-A53 (RPi3).

> A more recent kernel (912df91) still panics. See below.
> 
> Do you have time to look into this? What can I provide in information to help?
> 
> Regards,
> Ronald.
> 
> Van: Ronald Klop 
> Datum: maandag, 7 maart 2022 11:38
> Aan: Mark Millard 
> CC: bob prohaska , freebsd-current 
> , freebsd-...@freebsd.org
> Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
> panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
> 28))
> 
> Yes, I spoke to soon too. Often it crashes as soon as I start a parallel 
> poudriere build. But this time it went very far. As soon as nightly backups 
> kicked in it was game over again.
> I had read the mail of Bob on the arm@ ML. But I wanted to let the conclusion 
> that it is about the same problem to the developers. (Have seen enough of 
> wrong guessing of causes in my work. )
> 
> I will need to go further into the binary search of working kernels.
> 
> This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC 2022
> Fatal data abort: 
>   
>   x0: 00f1efd8  x0: 00f1efd8 (mac_policy_rm + 0) 
> (mac_policy_rm + 0)   
>   
>   
>   x1:2  x1:2  
>  
>   
>   
>   x2: 0087dcf2  x2: 0087dcf2 (cam_status_table + 2f28a)   
>   
>  (cam_status_table + 2f28a)  x3: 0087dcf2 
>   
>   x3: 0087dcf2 (cam_status_table + 2f28a) (cam_status_table + 2f28a)  
>  
>   
>   
>   x4:  102  x4:  102  
>  
>   
>   
>   x5:7  x5:1  
>  
>   
>   
>   x6:0  x6:   ff  
>  
>   
>   
>   x7:0  x7: a00011fc2800  
>  
>   x8:1   

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Ronald Klop

Dear Mark Johnston,

I did some binary search in the kernels and came to the conclusion that 
https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
 still works and 
https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
 panics.

I suspect your commit in 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.

Last panic:

panic: vm_fault failed: 0046e708 error 1
cpuid = 1
time = 1646660058
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2e8
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x9604
_rm_rlock_debug() at _rm_rlock_debug+0x8c
osd_get() at osd_get+0x5c
zio_execute() at zio_execute+0xf8
taskqueue_run_locked() at taskqueue_run_locked+0x178
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 0 tid 100129 ]
Stopped at  kdb_enter+0x44: undefined   f902011f
db>

A more recent kernel (912df91) still panics. See below.

Do you have time to look into this? What can I provide in information to help?

Regards,
Ronald.


Van: Ronald Klop 
Datum: maandag, 7 maart 2022 11:38
Aan: Mark Millard 
CC: bob prohaska , freebsd-current 
, freebsd-...@freebsd.org
Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 
28))


Yes, I spoke to soon too. Often it crashes as soon as I start a parallel 
poudriere build. But this time it went very far. As soon as nightly backups 
kicked in it was game over again.
I had read the mail of Bob on the arm@ ML. But I wanted to let the conclusion 
that it is about the same problem to the developers. (Have seen enough of wrong 
guessing of causes in my work. )

I will need to go further into the binary search of working kernels.

This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC 2022
Fatal data abort:   
  x0: 00f1efd8  x0: 00f1efd8 (mac_policy_rm + 0) (mac_policy_rm + 0)

  x1:2  x1:2

  x2: 0087dcf2  x2: 0087dcf2 (cam_status_table + 2f28a) 
 (cam_status_table + 2f28a)  x3: 0087dcf2   
  x3: 0087dcf2 (cam_status_table + 2f28a) (cam_status_table + 2f28a)

  x4:  102  x4:  102

  x5:7  x5:1

  x6:0  x6:   ff

  x7:0  x7: a00011fc2800
  x8:1  

  x8:1  x9: 00f37c10
  x9: 419d9090 (pcpu0 + 90) (g_ctx + 40278fe4)  

 x10: a0017be2a600 x10: a10fa

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Ronald Klop
)
   
x17: 0044a998 x17: 0058ff30 (free + 0) (if_inc_counter + 0)
   
x18: b49a23c0 x18: 000103f11b80 (g_ctx + b3242314) 
(next_index + 3a228c0) x19:  102   

  
x19:  102 x20: b49a2428
x20: 000103f11be8 (g_ctx + b324237c) (next_index + 3a22928)


x21: 0087dcf2 x21: 0087dcf2 (cam_status_table + 2f28a) 
(cam_status_table + 2f28a)

x22: 00f1efd8 x22: 00f1efd8 (mac_policy_rm + 0) (mac_policy_rm 
+ 0)

x23: 0086f107 x23:0 (cam_status_table + 2069f)

x24: a0001fbde6c8 x24: a0008cba0d00
x25:0

x25: 0088aa11 x26:4 (do_execve.fexecv_proc_title + 76b7)

x27:0 x26: a0017be2a600
x28: 00010209fcf0
x27: a00025626a80 (next_index + 1bb0a30)

x28: 000103f11ce0 x29: b49a23e0 (next_index + 3a22a20) (g_ctx + 
b3242334)

x29: 000103f11ba0  sp: b49a23c0
(next_index + 3a228e0)  lr: 0046ef98
 sp: 000103f11b80
(_rm_runlock_debug + 60)  lr: 0046ef98
elr: 0046dc0c (_rm_runlock_debug + 60) (_rm_assert + a4)

elr: 0046dc0cspsr:   45
(_rm_assert + a4) far:   10

esr: 9604
spsr:   45

panic: data abort in critical section or under mutex
cpuid = 1
time = 1646609483
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2d4
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x9604
_rm_assert() at _rm_assert+0xa4
_rm_runlock_debug() at _rm_runlock_debug+0x5c
mac_inpcb_check_deliver() at mac_inpcb_check_deliver+0x74
tcp_input_with_port() at tcp_input_with_port+0xab4
tcp_input() at tcp_input+0xc
ip_input() at ip_input+0x2e8
netisr_dispatch_src() at netisr_dispatch_src+0xe4
ether_demux() at ether_demux+0x178
ether_nh_input() at ether_nh_input+0x3e8
netisr_dispatch_src() at netisr_dispatch_src+0xe4
ether_input() at ether_input+0x80
if_input() at if_input+0xc
gen_intr() at gen_intr+0x444
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100063 ]
Stopped at  kdb_enter+0x44: undefined   f902011f
db>

NB: db> reboot/reset/halt does not work on my RPI4. Luckily I have a wifi 
connected power switch on it.

Regards,
Ronald.


Van: Mark Millard 
Datum: maandag, 7 maart 2022 02:01
Aan: Ronald Klop 
CC: freebsd-current , bob prohaska 

Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 
28))


From: Ronald Klop  wrote on
Date: Sun, 6 Mar 2022 23:22:42 +0100 (CET) :

> Did some binary search with kernels from artifact.ci.freebsd.org.
>
> I suspect "rmlock: Micro-optimize read locking" as cause.
>
> 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a
>
>
> And "rmlock: Add required compiler barriers to _rm_runlock()" as solution.
>
> 
https://cgit.freebsd.org/src/commit/?id=89ae8eb74e87ac19aa2d7abe4ba16bcccd32bb9f
>
>
> So I probably just had a bad day.

Well, there is a report of a buildkernel crash after that pair:

https://lists.freebsd.org/archives/freebsd-arm/2022-March/001078.html

that references additional information at:

http://www.zefox.net/~fbsd/rpi3/crashes/20220304/readme

and reported:

QUOTE
The console connection dropped before the crash (unrelated) I didn't
get the preamble, all  I have is the backtrace and buildkernel log.
Here's the backtrace:
db> bt
Tracing pid 14795 tid 100098 td 0xa00017815600
db_trace_self() at db_trace_self
db_stack_trace() at db_stack_trace+0x11c
db_command() at db_command+0x368
db_command_loop() at db_command_loop+0x54
db_trap() at db_trap+0xf8
kdb_trap() at kdb_trap+0x1cc
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0xf200
kdb_enter() at kdb_enter+0x44
vpanic() at vpanic+0x1b0
panic() at panic+0x44
data_abort() at data_abort+0x2e8
handle_el1h_sync() at handle_el1h_syn

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-06 Thread Mark Millard
From: Ronald Klop  wrote on
Date: Sun, 6 Mar 2022 23:22:42 +0100 (CET) :

> Did some binary search with kernels from artifact.ci.freebsd.org.
> 
> I suspect "rmlock: Micro-optimize read locking" as cause.
> 
> https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a
> 
> 
> And "rmlock: Add required compiler barriers to _rm_runlock()" as solution.
> 
> https://cgit.freebsd.org/src/commit/?id=89ae8eb74e87ac19aa2d7abe4ba16bcccd32bb9f
> 
> 
> So I probably just had a bad day.

Well, there is a report of a buildkernel crash after that pair:

https://lists.freebsd.org/archives/freebsd-arm/2022-March/001078.html

that references additional information at:

http://www.zefox.net/~fbsd/rpi3/crashes/20220304/readme

and reported:

QUOTE
The console connection dropped before the crash (unrelated) I didn't
get the preamble, all  I have is the backtrace and buildkernel log. 
Here's the backtrace:
db> bt
Tracing pid 14795 tid 100098 td 0xa00017815600
db_trace_self() at db_trace_self
db_stack_trace() at db_stack_trace+0x11c
db_command() at db_command+0x368
db_command_loop() at db_command_loop+0x54
db_trap() at db_trap+0xf8
kdb_trap() at kdb_trap+0x1cc
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0xf200
kdb_enter() at kdb_enter+0x44
vpanic() at vpanic+0x1b0
panic() at panic+0x44
data_abort() at data_abort+0x2e8
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x9604
_rm_rlock_debug() at _rm_rlock_debug+0x8c
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x140
sysctl_root() at sysctl_root+0x1ac
userland_sysctl() at userland_sysctl+0x140
sys___sysctl() at sys___sysctl+0x68
do_el0_sync() at do_el0_sync+0x520
handle_el0_sync() at handle_el0_sync+0x40
--- exception, esr 0x5600
END QUOTE

The above material does reference _rm_rlock_debug . Might be
related?

The readme reports:

main-n253603-0b25cbc79d3: Thu Mar  3 22:48:31 PST 2022

for the system doing the buildkernel. This is after
89ae8eb74e8 .

(It also mentions another panic earlier in the week,
apparently not reported to the lists at the time.)

===
Mark Millard
marklmi at yahoo.com




Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-06 Thread Ronald Klop

Hi,

Did some binary search with kernels from artifact.ci.freebsd.org.

I suspect "rmlock: Micro-optimize read locking" as cause.
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a

And "rmlock: Add required compiler barriers to _rm_runlock()" as solution.
https://cgit.freebsd.org/src/commit/?id=89ae8eb74e87ac19aa2d7abe4ba16bcccd32bb9f

So I probably just had a bad day.

Regards,
Ronald.


Van: Ronald Klop 
Datum: zaterdag, 5 maart 2022 16:09
Aan: FreeBSD Current 
Onderwerp: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
28))


Hi,

Another panic while building world/kernel. Different panic message and trace.
 
  x0: 1f5e152c32cc   
  x1: b630a000 (g_ctx + b4c4a254)   
  x2:1  
  x3:   2e  
  x4: a000bb46d600  
  x5:0  
  x6:0  x7: 00f05104 (has_pan + 0)

  x8:1
  x9: 809c227c
 x10:   bd
 x11:   40
 x12:0
 x13:1
 x14: 1782f000
 x15: 1001
 x16: 1782f003
 x17: 1f5e957392f0
 x18: 00010719e630 (next_index + 2cac528)
 x19: 00010719e768 (next_index + 2cac660)
 x20:1
 x21: b630a000 (g_ctx + b4c4a254)
 x22:1
 x23: ffbf
 x24: 00010719e758 (next_index + 2cac650)
 x25: a00026cdd160
 x26:1
 x27: a000bb46d600
 x28: 0092815a (do_execve.fexecv_proc_title + 5483)
 x29: 00010719e630 (next_index + 2cac528)
  sp: 00010719e630
  lr: 0053e890 (uiomove_faultflag + 128)
 elr: 00804f80 (byte_by_byte + 4)
spsr:   45
 far: b630a000 (g_ctx + b4c4a254)
 esr: 9647
panic: data abort in critical section or under mutex
cpuid = 2
time = 1646489189
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2a8
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x9647
byte_by_byte() at byte_by_byte+0x4
pipe_write() at pipe_write+0x668
KDB: enter: panic
[ thread pid 68336 tid 100593 ]
Stopped at  kdb_enter+0x44: undefined   f901c11f
db>



Regards,
Ronald.


 
Van: Ronald Klop 

Datum: zaterdag, 5 maart 2022 12:16
Aan: FreeBSD Current 
Onderwerp: panic: Unknown kernel exception 0 esr_el1 200 (on 
14-CURRENT/aarch64 Feb 28)


Hi,

Repeated panics on 14-CURRENT/aarch64. This happens e.g. when the nigthly 
backup is started.
# uname -a
FreeBSD rpi4 14.0-CURRENT FreeBSD 14.0-CURRENT #22 main-5f702d6d9a: Mon Feb 28 
06:12:48 CET 2022 
ronald@rpi4:/home/ronald/dev/obj/home/ronald/dev/freebsd/arm64.aarch64/sys/GENERIC-NODEBUG
 arm64

It was stable with all kernels until (and including) "FreeBSD 14.0-CURRENT #21 
main-e11ad014d1-dirty: Sat Feb  5 00:09:08 CET 2022".

It runs ZFS-on-root via an USB disk. No other FS involved.
# gpart show
=>40  1953458096  da0  GPT  (931G)
  40  1024001  efi  (50M)
  102440 83886082  freebsd-swap  (4.0G)
 8491048  19449670883  freebsd-zfs  (927G)


Output on serial console:
x0: a59c1380   
  x1: a59b1600  
  x2:3  
  x3: a001862779a0  
  x4:0
  x5:9438238792a1a

  x6:d217e9df58308
  x7:   14
  x8: a59c1398
  x9:1
 x10: a59b1600
 x11:2
 x12:1
 x13: f2557a42c5b0f240
 x14: 1013e6b85a8ecbe4
 x15: 24f981889f30
 x16: 4afedeb89cb8
 x17: fff2
 x18: fe666800 (g_ctx + fcfa6a54)
 x19:0
 x20: fec41000 (g_ctx + fd581254)
 x21:3
 x22: 419bb090 (g_ctx + 402fb2e4)
 x23: 00c09bb7 (lockstat_enabled + 0)
 x24:  18