Re: panic: kernel diagnostic assertion "l == curlwp"
> On Jun 6, 2020, at 10:19 AM, Jared McNeill wrote: > > On Sat, 6 Jun 2020, Jason Thorpe wrote: > >> >>> On Jun 6, 2020, at 9:21 AM, Jared McNeill wrote: >>> >>> KASSERT-aside, I am left wondering based on that stack trace how copyout on >>> aarch64 can fail here at all (that seems to be the only way that lwp_exit >>> from sys__lwp_create can happen). >> >> What were you doing at the time? I guess obviously running a program that >> uses pthreads... > > pkgsrc bulk build, so ¯\_(ツ)_/¯ Alright, well, I checked in a unit test for it, as well as a fix. You'll have to figure out why the copyout() failed, though. -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
On Sat, 6 Jun 2020, Jason Thorpe wrote: On Jun 6, 2020, at 9:21 AM, Jared McNeill wrote: KASSERT-aside, I am left wondering based on that stack trace how copyout on aarch64 can fail here at all (that seems to be the only way that lwp_exit from sys__lwp_create can happen). What were you doing at the time? I guess obviously running a program that uses pthreads... pkgsrc bulk build, so ¯\_(ツ)_/¯
Re: panic: kernel diagnostic assertion "l == curlwp"
> On Jun 6, 2020, at 10:19 AM, Jared McNeill wrote: > > On Sat, 6 Jun 2020, Jason Thorpe wrote: > >> >>> On Jun 6, 2020, at 9:21 AM, Jared McNeill wrote: >>> >>> KASSERT-aside, I am left wondering based on that stack trace how copyout on >>> aarch64 can fail here at all (that seems to be the only way that lwp_exit >>> from sys__lwp_create can happen). >> >> What were you doing at the time? I guess obviously running a program that >> uses pthreads... > > pkgsrc bulk build, so ¯\_(ツ)_/¯ Well, at the very least we should make a unit test for _lwp_create(2) that exercises the error path :-) -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
> On Jun 6, 2020, at 9:21 AM, Jared McNeill wrote: > > KASSERT-aside, I am left wondering based on that stack trace how copyout on > aarch64 can fail here at all (that seems to be the only way that lwp_exit > from sys__lwp_create can happen). What were you doing at the time? I guess obviously running a program that uses pthreads... -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
KASSERT-aside, I am left wondering based on that stack trace how copyout on aarch64 can fail here at all (that seems to be the only way that lwp_exit from sys__lwp_create can happen). On Sat, 6 Jun 2020, Jason Thorpe wrote: +rmind@ +ad@ On Jun 6, 2020, at 5:00 AM, Jared McNeill wrote: Looks like I hit another one on this path with your latest fix in place: [ 3737.4034537] panic: kernel diagnostic assertion "l == curlwp || ((l->l_flag & LW_SYSTEM) && pcu_valid == 0)" failed: file "/home/source/ab/HEAD/src/sys/kern/subr_pcu.c", line 133 This assertion was added in rev 1.18 by Mindaugas, 6 years ago. It obviously needs to be adjusted to handle the "lwp creation failed", but I'd have to study that code a lot more before I feel comfortable doing so. [ 3737.4134523] cpu18: Begin traceback... [ 3737.4234525] trace fp c0086cad0c50 [ 3737.4234525] fp c0086cad0c70 vpanic() at c04b230c netbsd:vpanic+0x15c [ 3737.4334526] fp c0086cad0ce0 kern_assert() at c07d052c netbsd:kern_assert+0x5c [ 3737.4434535] fp c0086cad0d70 pcu_discard_all() at c04a9a58 netbsd:pcu_discard_all+0x58 [ 3737.4534583] fp c0086cad0d90 lwp_exit() at c0461558 netbsd:lwp_exit+0x1b0 [ 3737.4534583] fp c0086cad0dd0 sys__lwp_create() at c04c4ed8 netbsd:sys__lwp_create+0xe8 [ 3737.4634613] fp c0086cad0e20 syscall() at c008a624 netbsd:syscall+0x18c [ 3737.4734599] tf c0086cad0ed0 el0_trap() at c0088c34 netbsd:el0_trap -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
+rmind@ +ad@ > On Jun 6, 2020, at 5:00 AM, Jared McNeill wrote: > > Looks like I hit another one on this path with your latest fix in place: > > [ 3737.4034537] panic: kernel diagnostic assertion "l == curlwp || > ((l->l_flag & LW_SYSTEM) && pcu_valid == 0)" failed: file > "/home/source/ab/HEAD/src/sys/kern/subr_pcu.c", line 133 This assertion was added in rev 1.18 by Mindaugas, 6 years ago. It obviously needs to be adjusted to handle the "lwp creation failed", but I'd have to study that code a lot more before I feel comfortable doing so. > [ 3737.4134523] cpu18: Begin traceback... > [ 3737.4234525] trace fp c0086cad0c50 > [ 3737.4234525] fp c0086cad0c70 vpanic() at c04b230c > netbsd:vpanic+0x15c > [ 3737.4334526] fp c0086cad0ce0 kern_assert() at c07d052c > netbsd:kern_assert+0x5c > [ 3737.4434535] fp c0086cad0d70 pcu_discard_all() at c04a9a58 > netbsd:pcu_discard_all+0x58 > [ 3737.4534583] fp c0086cad0d90 lwp_exit() at c0461558 > netbsd:lwp_exit+0x1b0 > [ 3737.4534583] fp c0086cad0dd0 sys__lwp_create() at c04c4ed8 > netbsd:sys__lwp_create+0xe8 > [ 3737.4634613] fp c0086cad0e20 syscall() at c008a624 > netbsd:syscall+0x18c > [ 3737.4734599] tf c0086cad0ed0 el0_trap() at c0088c34 > netbsd:el0_trap -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
Looks like I hit another one on this path with your latest fix in place: [ 3737.4034537] panic: kernel diagnostic assertion "l == curlwp || ((l->l_flag & LW_SYSTEM) && pcu_valid == 0)" failed: file "/home/source/ab/HEAD/src/sys/kern/subr_pcu.c", line 133 [ 3737.4134523] cpu18: Begin traceback... [ 3737.4234525] trace fp c0086cad0c50 [ 3737.4234525] fp c0086cad0c70 vpanic() at c04b230c netbsd:vpanic+0x15c [ 3737.4334526] fp c0086cad0ce0 kern_assert() at c07d052c netbsd:kern_assert+0x5c [ 3737.4434535] fp c0086cad0d70 pcu_discard_all() at c04a9a58 netbsd:pcu_discard_all+0x58 [ 3737.4534583] fp c0086cad0d90 lwp_exit() at c0461558 netbsd:lwp_exit+0x1b0 [ 3737.4534583] fp c0086cad0dd0 sys__lwp_create() at c04c4ed8 netbsd:sys__lwp_create+0xe8 [ 3737.4634613] fp c0086cad0e20 syscall() at c008a624 netbsd:syscall+0x18c [ 3737.4734599] tf c0086cad0ed0 el0_trap() at c0088c34 netbsd:el0_trap [ 3737.4834609] trapframe 0xc0086cad0ed0 (304 bytes) [ 3737.4834609] pc=fdbd319c8958, spsr=2000 [ 3737.4934603]esr=56000135,far=fdbd319eb030 [ 3737.4934603] x0=ffbc41c0, x1=fdbd31db2260 [ 3737.5034629] x2=fdbd31d8e200, x3= [ 3737.5134632] x4=ffbc4190, x5=0030 [ 3737.5134632] x6=ffbc41c0, x7=fdbd31af0208 [ 3737.5234664] x8=0001, x9=1003 [ 3737.5234664]x10=000c,x11=0001 [ 3737.5334641]x12=fdbd31da5c18,x13= [ 3737.5334641]x14=,x15=fdbd31400980 [ 3737.5434637]x16=fdbd31da22e0,x17=fdbd319c8954 [ 3737.5534632]x18=0001,x19=ffbc41c0 [ 3737.5534632]x20=ec172e60,x21=0002 [ 3737.5634733]x22=fdbd31db2260,x23=fdbd31d8e200 [ 3737.5634733]x24=ec172000,x25=ffbc4190 [ 3737.5734687]x26=ec1732a8,x27=ec173070 [ 3737.5734687]x28=fdbd31db3400, fp=x29=ffbc41d0 [ 3737.5834824] lr=x30=fdbd31d8e224, sp=ffbc40e0 [ 3737.5934691] [ 3737.5934691] cpu18: End traceback... Stopped in pid 14611.14611 (conftest) atnetbsd:cpu_Debugger+0x4:ret db{18}> On Mon, 1 Jun 2020, Jason Thorpe wrote: On Jun 1, 2020, at 6:36 AM, Kamil Rytarowski wrote: lwp_exit() used to work for curlwp and !curlwp. There is a regression that there was introduced code called from lwp_exit() calling lwp_thread_cleanup(l) that asserts curlwp, effectively enforcing lwp_exit() to be operational for curlwp only. I just reviewed the entire code path below that assertion again, and nothing in the current incarnation of the code relies on l == curlwp. I've removed the assertion just now. -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
> On Jun 1, 2020, at 6:36 AM, Kamil Rytarowski wrote: > > lwp_exit() used to work for curlwp and !curlwp. > > There is a regression that there was introduced code called from > lwp_exit() calling lwp_thread_cleanup(l) that asserts curlwp, > effectively enforcing lwp_exit() to be operational for curlwp only. I just reviewed the entire code path below that assertion again, and nothing in the current incarnation of the code relies on l == curlwp. I've removed the assertion just now. -- thorpej
Re: panic: kernel diagnostic assertion "l == curlwp"
lwp_exit() used to work for curlwp and !curlwp. There is a regression that there was introduced code called from lwp_exit() calling lwp_thread_cleanup(l) that asserts curlwp, effectively enforcing lwp_exit() to be operational for curlwp only. This was introduced by: commit 2de05dfb9c516db5509614b7be366e31205ceeaa Author: thorpej Date: Sat Apr 4 20:20:12 2020 + Add support for lazily generating a "global thread ID" for a LWP. This identifier uniquely identifies an LWP across the entire system, and will be used in future improvements in user-space synchronization primitives. (Test disabled and libc stub not included intentionally so as to avoid multiple libc version bumps.) On 01.06.2020 14:45, Jared McNeill wrote: > Looks like lwp_exit is called with something other than curlwp in the > sys__lwp_create error path: > > https://nxr.netbsd.org/xref/src/sys/kern/sys_lwp.c#156 > > > On Mon, 1 Jun 2020, Jared McNeill wrote: > >> Just hit this panic on 9.99.64: >> >> [ 6717.5700161] panic: kernel diagnostic assertion "l == curlwp" >> failed: file "/home/source/ab/HEAD/src/sys/kern/kern_lwp.c", line 2063 >> [ 6717.5800161] cpu18: Begin traceback... >> [ 6717.5900170] trace fp c0088f5d3c50 >> [ 6717.5900170] fp c0088f5d3c70 vpanic() at c04b0334 >> netbsd:vpanic+0x15c >> [ 6717.6000167] fp c0088f5d3ce0 kern_assert() at c07ce26c >> netbsd:kern_assert+0x5c >> [ 6717.6100211] fp c0088f5d3d70 lwp_thread_cleanup() at >> c045f368 netbsd:lwp_thread_cleanup+0x80 >> [ 6717.6200270] fp c0088f5d3d90 lwp_exit() at c045f4ac >> netbsd:lwp_exit+0xcc >> [ 6717.6200270] fp c0088f5d3dd0 sys__lwp_create() at >> c04c2f00 netbsd:sys__lwp_create+0xe8 >> [ 6717.6300215] fp c0088f5d3e20 syscall() at c008a63c >> netbsd:syscall+0x18c >> [ 6717.6400221] tf c0088f5d3ed0 el0_trap() at c0088c74 >> netbsd:el0_trap >> [ 6717.6500214] trapframe 0xc0088f5d3ed0 (304 bytes) >> [ 6717.6500214] pc=f8abc4638958, spsr=2000 >> [ 6717.6600309] esr=56000135, far=f8abc465b030 >> [ 6717.6600309] x0=ffe68500, x1=f8abc4a1a260 >> [ 6717.6700228] x2=f8abc49fe200, x3= >> [ 6717.6800338] x4=ffe684d0, x5=0030 >> [ 6717.6800338] x6=ffe68500, x7=f8abc4760208 >> [ 6717.6900234] x8=0001, x9=1003 >> [ 6717.6900234] x10=000c, x11=0001 >> [ 6717.7000252] x12=f8abc49e91d8, x13= >> [ 6717.7000252] x14=, x15=f8abc4000980 >> [ 6717.7100242] x16=f8abc4a122e0, x17=f8abc4638954 >> [ 6717.7200306] x18=0001, x19=ffe68500 >> [ 6717.7200306] x20=f9a72e60, x21=0002 >> [ 6717.7300283] x22=f8abc4a1a260, x23=f8abc49fe200 >> [ 6717.7300283] x24=f9a72000, x25=ffe684d0 >> [ 6717.7400357] x26=f9a732a8, x27=f9a73070 >> [ 6717.7400357] x28=f8abc4a1b400, fp=x29=ffe68510 >> [ 6717.7500291] lr=x30=f8abc49fe224, sp=ffe68420 >> [ 6717.7600300] >> [ 6717.7600300] cpu18: End traceback... >> Stopped in pid 152.152 (conftest) at >> netbsd:cpu_Debugger+0x4: ret >> db{18}> >> >> >> signature.asc Description: OpenPGP digital signature
Re: panic: kernel diagnostic assertion "l == curlwp"
Looks like lwp_exit is called with something other than curlwp in the sys__lwp_create error path: https://nxr.netbsd.org/xref/src/sys/kern/sys_lwp.c#156 On Mon, 1 Jun 2020, Jared McNeill wrote: Just hit this panic on 9.99.64: [ 6717.5700161] panic: kernel diagnostic assertion "l == curlwp" failed: file "/home/source/ab/HEAD/src/sys/kern/kern_lwp.c", line 2063 [ 6717.5800161] cpu18: Begin traceback... [ 6717.5900170] trace fp c0088f5d3c50 [ 6717.5900170] fp c0088f5d3c70 vpanic() at c04b0334 netbsd:vpanic+0x15c [ 6717.6000167] fp c0088f5d3ce0 kern_assert() at c07ce26c netbsd:kern_assert+0x5c [ 6717.6100211] fp c0088f5d3d70 lwp_thread_cleanup() at c045f368 netbsd:lwp_thread_cleanup+0x80 [ 6717.6200270] fp c0088f5d3d90 lwp_exit() at c045f4ac netbsd:lwp_exit+0xcc [ 6717.6200270] fp c0088f5d3dd0 sys__lwp_create() at c04c2f00 netbsd:sys__lwp_create+0xe8 [ 6717.6300215] fp c0088f5d3e20 syscall() at c008a63c netbsd:syscall+0x18c [ 6717.6400221] tf c0088f5d3ed0 el0_trap() at c0088c74 netbsd:el0_trap [ 6717.6500214] trapframe 0xc0088f5d3ed0 (304 bytes) [ 6717.6500214] pc=f8abc4638958, spsr=2000 [ 6717.6600309]esr=56000135,far=f8abc465b030 [ 6717.6600309] x0=ffe68500, x1=f8abc4a1a260 [ 6717.6700228] x2=f8abc49fe200, x3= [ 6717.6800338] x4=ffe684d0, x5=0030 [ 6717.6800338] x6=ffe68500, x7=f8abc4760208 [ 6717.6900234] x8=0001, x9=1003 [ 6717.6900234]x10=000c,x11=0001 [ 6717.7000252]x12=f8abc49e91d8,x13= [ 6717.7000252]x14=,x15=f8abc4000980 [ 6717.7100242]x16=f8abc4a122e0,x17=f8abc4638954 [ 6717.7200306]x18=0001,x19=ffe68500 [ 6717.7200306]x20=f9a72e60,x21=0002 [ 6717.7300283]x22=f8abc4a1a260,x23=f8abc49fe200 [ 6717.7300283]x24=f9a72000,x25=ffe684d0 [ 6717.7400357]x26=f9a732a8,x27=f9a73070 [ 6717.7400357]x28=f8abc4a1b400, fp=x29=ffe68510 [ 6717.7500291] lr=x30=f8abc49fe224, sp=ffe68420 [ 6717.7600300] [ 6717.7600300] cpu18: End traceback... Stopped in pid 152.152 (conftest) atnetbsd:cpu_Debugger+0x4:ret db{18}>
panic: kernel diagnostic assertion "l == curlwp"
Just hit this panic on 9.99.64: [ 6717.5700161] panic: kernel diagnostic assertion "l == curlwp" failed: file "/home/source/ab/HEAD/src/sys/kern/kern_lwp.c", line 2063 [ 6717.5800161] cpu18: Begin traceback... [ 6717.5900170] trace fp c0088f5d3c50 [ 6717.5900170] fp c0088f5d3c70 vpanic() at c04b0334 netbsd:vpanic+0x15c [ 6717.6000167] fp c0088f5d3ce0 kern_assert() at c07ce26c netbsd:kern_assert+0x5c [ 6717.6100211] fp c0088f5d3d70 lwp_thread_cleanup() at c045f368 netbsd:lwp_thread_cleanup+0x80 [ 6717.6200270] fp c0088f5d3d90 lwp_exit() at c045f4ac netbsd:lwp_exit+0xcc [ 6717.6200270] fp c0088f5d3dd0 sys__lwp_create() at c04c2f00 netbsd:sys__lwp_create+0xe8 [ 6717.6300215] fp c0088f5d3e20 syscall() at c008a63c netbsd:syscall+0x18c [ 6717.6400221] tf c0088f5d3ed0 el0_trap() at c0088c74 netbsd:el0_trap [ 6717.6500214] trapframe 0xc0088f5d3ed0 (304 bytes) [ 6717.6500214] pc=f8abc4638958, spsr=2000 [ 6717.6600309]esr=56000135,far=f8abc465b030 [ 6717.6600309] x0=ffe68500, x1=f8abc4a1a260 [ 6717.6700228] x2=f8abc49fe200, x3= [ 6717.6800338] x4=ffe684d0, x5=0030 [ 6717.6800338] x6=ffe68500, x7=f8abc4760208 [ 6717.6900234] x8=0001, x9=1003 [ 6717.6900234]x10=000c,x11=0001 [ 6717.7000252]x12=f8abc49e91d8,x13= [ 6717.7000252]x14=,x15=f8abc4000980 [ 6717.7100242]x16=f8abc4a122e0,x17=f8abc4638954 [ 6717.7200306]x18=0001,x19=ffe68500 [ 6717.7200306]x20=f9a72e60,x21=0002 [ 6717.7300283]x22=f8abc4a1a260,x23=f8abc49fe200 [ 6717.7300283]x24=f9a72000,x25=ffe684d0 [ 6717.7400357]x26=f9a732a8,x27=f9a73070 [ 6717.7400357]x28=f8abc4a1b400, fp=x29=ffe68510 [ 6717.7500291] lr=x30=f8abc49fe224, sp=ffe68420 [ 6717.7600300] [ 6717.7600300] cpu18: End traceback... Stopped in pid 152.152 (conftest) atnetbsd:cpu_Debugger+0x4:ret db{18}>