Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jason Thorpe


> On Jun 6, 2020, at 10:19 AM, Jared McNeill  wrote:
> 
> On Sat, 6 Jun 2020, Jason Thorpe wrote:
> 
>> 
>>> On Jun 6, 2020, at 9:21 AM, Jared McNeill  wrote:
>>> 
>>> KASSERT-aside, I am left wondering based on that stack trace how copyout on 
>>> aarch64 can fail here at all (that seems to be the only way that lwp_exit 
>>> from sys__lwp_create can happen).
>> 
>> What were you doing at the time?  I guess obviously running a program that 
>> uses pthreads...
> 
> pkgsrc bulk build, so ¯\_(ツ)_/¯

Alright, well, I checked in a unit test for it, as well as a fix.  You'll have 
to figure out why the copyout() failed, though.

-- thorpej



Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jared McNeill

On Sat, 6 Jun 2020, Jason Thorpe wrote:




On Jun 6, 2020, at 9:21 AM, Jared McNeill  wrote:

KASSERT-aside, I am left wondering based on that stack trace how copyout on 
aarch64 can fail here at all (that seems to be the only way that lwp_exit from 
sys__lwp_create can happen).


What were you doing at the time?  I guess obviously running a program that uses 
pthreads...


pkgsrc bulk build, so ¯\_(ツ)_/¯

Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jason Thorpe


> On Jun 6, 2020, at 10:19 AM, Jared McNeill  wrote:
> 
> On Sat, 6 Jun 2020, Jason Thorpe wrote:
> 
>> 
>>> On Jun 6, 2020, at 9:21 AM, Jared McNeill  wrote:
>>> 
>>> KASSERT-aside, I am left wondering based on that stack trace how copyout on 
>>> aarch64 can fail here at all (that seems to be the only way that lwp_exit 
>>> from sys__lwp_create can happen).
>> 
>> What were you doing at the time?  I guess obviously running a program that 
>> uses pthreads...
> 
> pkgsrc bulk build, so ¯\_(ツ)_/¯

Well, at the very least we should make a unit test for _lwp_create(2) that 
exercises the error path :-)

-- thorpej



Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jason Thorpe


> On Jun 6, 2020, at 9:21 AM, Jared McNeill  wrote:
> 
> KASSERT-aside, I am left wondering based on that stack trace how copyout on 
> aarch64 can fail here at all (that seems to be the only way that lwp_exit 
> from sys__lwp_create can happen).

What were you doing at the time?  I guess obviously running a program that uses 
pthreads...

-- thorpej



Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jared McNeill
KASSERT-aside, I am left wondering based on that stack trace how copyout 
on aarch64 can fail here at all (that seems to be the only way that 
lwp_exit from sys__lwp_create can happen).


On Sat, 6 Jun 2020, Jason Thorpe wrote:


+rmind@
+ad@


On Jun 6, 2020, at 5:00 AM, Jared McNeill  wrote:

Looks like I hit another one on this path with your latest fix in place:

[ 3737.4034537] panic: kernel diagnostic assertion "l == curlwp || ((l->l_flag & LW_SYSTEM) && 
pcu_valid == 0)" failed: file "/home/source/ab/HEAD/src/sys/kern/subr_pcu.c", line 133


This assertion was added in rev 1.18 by Mindaugas, 6 years ago.  It obviously needs to be 
adjusted to handle the "lwp creation failed", but I'd have to study that code a 
lot more before I feel comfortable doing so.


[ 3737.4134523] cpu18: Begin traceback...
[ 3737.4234525] trace fp c0086cad0c50
[ 3737.4234525] fp c0086cad0c70 vpanic() at c04b230c 
netbsd:vpanic+0x15c
[ 3737.4334526] fp c0086cad0ce0 kern_assert() at c07d052c 
netbsd:kern_assert+0x5c
[ 3737.4434535] fp c0086cad0d70 pcu_discard_all() at c04a9a58 
netbsd:pcu_discard_all+0x58
[ 3737.4534583] fp c0086cad0d90 lwp_exit() at c0461558 
netbsd:lwp_exit+0x1b0
[ 3737.4534583] fp c0086cad0dd0 sys__lwp_create() at c04c4ed8 
netbsd:sys__lwp_create+0xe8
[ 3737.4634613] fp c0086cad0e20 syscall() at c008a624 
netbsd:syscall+0x18c
[ 3737.4734599] tf c0086cad0ed0 el0_trap() at c0088c34 
netbsd:el0_trap


-- thorpej






Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jason Thorpe
+rmind@
+ad@

> On Jun 6, 2020, at 5:00 AM, Jared McNeill  wrote:
> 
> Looks like I hit another one on this path with your latest fix in place:
> 
> [ 3737.4034537] panic: kernel diagnostic assertion "l == curlwp || 
> ((l->l_flag & LW_SYSTEM) && pcu_valid == 0)" failed: file 
> "/home/source/ab/HEAD/src/sys/kern/subr_pcu.c", line 133

This assertion was added in rev 1.18 by Mindaugas, 6 years ago.  It obviously 
needs to be adjusted to handle the "lwp creation failed", but I'd have to study 
that code a lot more before I feel comfortable doing so.

> [ 3737.4134523] cpu18: Begin traceback...
> [ 3737.4234525] trace fp c0086cad0c50
> [ 3737.4234525] fp c0086cad0c70 vpanic() at c04b230c 
> netbsd:vpanic+0x15c
> [ 3737.4334526] fp c0086cad0ce0 kern_assert() at c07d052c 
> netbsd:kern_assert+0x5c
> [ 3737.4434535] fp c0086cad0d70 pcu_discard_all() at c04a9a58 
> netbsd:pcu_discard_all+0x58
> [ 3737.4534583] fp c0086cad0d90 lwp_exit() at c0461558 
> netbsd:lwp_exit+0x1b0
> [ 3737.4534583] fp c0086cad0dd0 sys__lwp_create() at c04c4ed8 
> netbsd:sys__lwp_create+0xe8
> [ 3737.4634613] fp c0086cad0e20 syscall() at c008a624 
> netbsd:syscall+0x18c
> [ 3737.4734599] tf c0086cad0ed0 el0_trap() at c0088c34 
> netbsd:el0_trap

-- thorpej



Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-06 Thread Jared McNeill

Looks like I hit another one on this path with your latest fix in place:

[ 3737.4034537] panic: kernel diagnostic assertion "l == curlwp || ((l->l_flag & LW_SYSTEM) && 
pcu_valid == 0)" failed: file "/home/source/ab/HEAD/src/sys/kern/subr_pcu.c", line 133
[ 3737.4134523] cpu18: Begin traceback...
[ 3737.4234525] trace fp c0086cad0c50
[ 3737.4234525] fp c0086cad0c70 vpanic() at c04b230c 
netbsd:vpanic+0x15c
[ 3737.4334526] fp c0086cad0ce0 kern_assert() at c07d052c 
netbsd:kern_assert+0x5c
[ 3737.4434535] fp c0086cad0d70 pcu_discard_all() at c04a9a58 
netbsd:pcu_discard_all+0x58
[ 3737.4534583] fp c0086cad0d90 lwp_exit() at c0461558 
netbsd:lwp_exit+0x1b0
[ 3737.4534583] fp c0086cad0dd0 sys__lwp_create() at c04c4ed8 
netbsd:sys__lwp_create+0xe8
[ 3737.4634613] fp c0086cad0e20 syscall() at c008a624 
netbsd:syscall+0x18c
[ 3737.4734599] tf c0086cad0ed0 el0_trap() at c0088c34 
netbsd:el0_trap
[ 3737.4834609]  trapframe 0xc0086cad0ed0 (304 bytes) 
[ 3737.4834609] pc=fdbd319c8958,   spsr=2000
[ 3737.4934603]esr=56000135,far=fdbd319eb030
[ 3737.4934603] x0=ffbc41c0, x1=fdbd31db2260
[ 3737.5034629] x2=fdbd31d8e200, x3=
[ 3737.5134632] x4=ffbc4190, x5=0030
[ 3737.5134632] x6=ffbc41c0, x7=fdbd31af0208
[ 3737.5234664] x8=0001, x9=1003
[ 3737.5234664]x10=000c,x11=0001
[ 3737.5334641]x12=fdbd31da5c18,x13=
[ 3737.5334641]x14=,x15=fdbd31400980
[ 3737.5434637]x16=fdbd31da22e0,x17=fdbd319c8954
[ 3737.5534632]x18=0001,x19=ffbc41c0
[ 3737.5534632]x20=ec172e60,x21=0002
[ 3737.5634733]x22=fdbd31db2260,x23=fdbd31d8e200
[ 3737.5634733]x24=ec172000,x25=ffbc4190
[ 3737.5734687]x26=ec1732a8,x27=ec173070
[ 3737.5734687]x28=fdbd31db3400, fp=x29=ffbc41d0
[ 3737.5834824] lr=x30=fdbd31d8e224, sp=ffbc40e0
[ 3737.5934691] 
[ 3737.5934691] cpu18: End traceback...
Stopped in pid 14611.14611 (conftest) atnetbsd:cpu_Debugger+0x4:ret
db{18}>


On Mon, 1 Jun 2020, Jason Thorpe wrote:





On Jun 1, 2020, at 6:36 AM, Kamil Rytarowski  wrote:

lwp_exit() used to work for curlwp and !curlwp.

There is a regression that there was introduced code called from
lwp_exit() calling lwp_thread_cleanup(l) that asserts curlwp,
effectively enforcing lwp_exit() to be operational for curlwp only.


I just reviewed the entire code path below that assertion again, and nothing in 
the current incarnation of the code relies on l == curlwp.  I've removed the 
assertion just now.

-- thorpej




Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-01 Thread Jason Thorpe



> On Jun 1, 2020, at 6:36 AM, Kamil Rytarowski  wrote:
> 
> lwp_exit() used to work for curlwp and !curlwp.
> 
> There is a regression that there was introduced code called from
> lwp_exit() calling lwp_thread_cleanup(l) that asserts curlwp,
> effectively enforcing lwp_exit() to be operational for curlwp only.

I just reviewed the entire code path below that assertion again, and nothing in 
the current incarnation of the code relies on l == curlwp.  I've removed the 
assertion just now.

-- thorpej



Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-01 Thread Kamil Rytarowski
lwp_exit() used to work for curlwp and !curlwp.

There is a regression that there was introduced code called from
lwp_exit() calling lwp_thread_cleanup(l) that asserts curlwp,
effectively enforcing lwp_exit() to be operational for curlwp only.

This was introduced by:

commit 2de05dfb9c516db5509614b7be366e31205ceeaa
Author: thorpej 
Date:   Sat Apr 4 20:20:12 2020 +

Add support for lazily generating a "global thread ID" for a LWP.  This
identifier uniquely identifies an LWP across the entire system, and will
be used in future improvements in user-space synchronization primitives.

(Test disabled and libc stub not included intentionally so as to avoid
multiple libc version bumps.)

On 01.06.2020 14:45, Jared McNeill wrote:
> Looks like lwp_exit is called with something other than curlwp in the
> sys__lwp_create error path:
> 
> https://nxr.netbsd.org/xref/src/sys/kern/sys_lwp.c#156
> 
> 
> On Mon, 1 Jun 2020, Jared McNeill wrote:
> 
>> Just hit this panic on 9.99.64:
>>
>> [ 6717.5700161] panic: kernel diagnostic assertion "l == curlwp"
>> failed: file "/home/source/ab/HEAD/src/sys/kern/kern_lwp.c", line 2063
>> [ 6717.5800161] cpu18: Begin traceback...
>> [ 6717.5900170] trace fp c0088f5d3c50
>> [ 6717.5900170] fp c0088f5d3c70 vpanic() at c04b0334
>> netbsd:vpanic+0x15c
>> [ 6717.6000167] fp c0088f5d3ce0 kern_assert() at c07ce26c
>> netbsd:kern_assert+0x5c
>> [ 6717.6100211] fp c0088f5d3d70 lwp_thread_cleanup() at
>> c045f368 netbsd:lwp_thread_cleanup+0x80
>> [ 6717.6200270] fp c0088f5d3d90 lwp_exit() at c045f4ac
>> netbsd:lwp_exit+0xcc
>> [ 6717.6200270] fp c0088f5d3dd0 sys__lwp_create() at
>> c04c2f00 netbsd:sys__lwp_create+0xe8
>> [ 6717.6300215] fp c0088f5d3e20 syscall() at c008a63c
>> netbsd:syscall+0x18c
>> [ 6717.6400221] tf c0088f5d3ed0 el0_trap() at c0088c74
>> netbsd:el0_trap
>> [ 6717.6500214]  trapframe 0xc0088f5d3ed0 (304 bytes) 
>> [ 6717.6500214] pc=f8abc4638958,   spsr=2000
>> [ 6717.6600309]    esr=56000135,    far=f8abc465b030
>> [ 6717.6600309] x0=ffe68500, x1=f8abc4a1a260
>> [ 6717.6700228] x2=f8abc49fe200, x3=
>> [ 6717.6800338] x4=ffe684d0, x5=0030
>> [ 6717.6800338] x6=ffe68500, x7=f8abc4760208
>> [ 6717.6900234] x8=0001, x9=1003
>> [ 6717.6900234]    x10=000c,    x11=0001
>> [ 6717.7000252]    x12=f8abc49e91d8,    x13=
>> [ 6717.7000252]    x14=,    x15=f8abc4000980
>> [ 6717.7100242]    x16=f8abc4a122e0,    x17=f8abc4638954
>> [ 6717.7200306]    x18=0001,    x19=ffe68500
>> [ 6717.7200306]    x20=f9a72e60,    x21=0002
>> [ 6717.7300283]    x22=f8abc4a1a260,    x23=f8abc49fe200
>> [ 6717.7300283]    x24=f9a72000,    x25=ffe684d0
>> [ 6717.7400357]    x26=f9a732a8,    x27=f9a73070
>> [ 6717.7400357]    x28=f8abc4a1b400, fp=x29=ffe68510
>> [ 6717.7500291] lr=x30=f8abc49fe224, sp=ffe68420
>> [ 6717.7600300] 
>> [ 6717.7600300] cpu18: End traceback...
>> Stopped in pid 152.152 (conftest) at   
>> netbsd:cpu_Debugger+0x4:    ret
>> db{18}>
>>
>>
>>




signature.asc
Description: OpenPGP digital signature


Re: panic: kernel diagnostic assertion "l == curlwp"

2020-06-01 Thread Jared McNeill
Looks like lwp_exit is called with something other than curlwp in the 
sys__lwp_create error path:


https://nxr.netbsd.org/xref/src/sys/kern/sys_lwp.c#156


On Mon, 1 Jun 2020, Jared McNeill wrote:


Just hit this panic on 9.99.64:

[ 6717.5700161] panic: kernel diagnostic assertion "l == curlwp" failed: file 
"/home/source/ab/HEAD/src/sys/kern/kern_lwp.c", line 2063

[ 6717.5800161] cpu18: Begin traceback...
[ 6717.5900170] trace fp c0088f5d3c50
[ 6717.5900170] fp c0088f5d3c70 vpanic() at c04b0334 
netbsd:vpanic+0x15c
[ 6717.6000167] fp c0088f5d3ce0 kern_assert() at c07ce26c 
netbsd:kern_assert+0x5c
[ 6717.6100211] fp c0088f5d3d70 lwp_thread_cleanup() at c045f368 
netbsd:lwp_thread_cleanup+0x80
[ 6717.6200270] fp c0088f5d3d90 lwp_exit() at c045f4ac 
netbsd:lwp_exit+0xcc
[ 6717.6200270] fp c0088f5d3dd0 sys__lwp_create() at c04c2f00 
netbsd:sys__lwp_create+0xe8
[ 6717.6300215] fp c0088f5d3e20 syscall() at c008a63c 
netbsd:syscall+0x18c
[ 6717.6400221] tf c0088f5d3ed0 el0_trap() at c0088c74 
netbsd:el0_trap

[ 6717.6500214]  trapframe 0xc0088f5d3ed0 (304 bytes) 
[ 6717.6500214] pc=f8abc4638958,   spsr=2000
[ 6717.6600309]esr=56000135,far=f8abc465b030
[ 6717.6600309] x0=ffe68500, x1=f8abc4a1a260
[ 6717.6700228] x2=f8abc49fe200, x3=
[ 6717.6800338] x4=ffe684d0, x5=0030
[ 6717.6800338] x6=ffe68500, x7=f8abc4760208
[ 6717.6900234] x8=0001, x9=1003
[ 6717.6900234]x10=000c,x11=0001
[ 6717.7000252]x12=f8abc49e91d8,x13=
[ 6717.7000252]x14=,x15=f8abc4000980
[ 6717.7100242]x16=f8abc4a122e0,x17=f8abc4638954
[ 6717.7200306]x18=0001,x19=ffe68500
[ 6717.7200306]x20=f9a72e60,x21=0002
[ 6717.7300283]x22=f8abc4a1a260,x23=f8abc49fe200
[ 6717.7300283]x24=f9a72000,x25=ffe684d0
[ 6717.7400357]x26=f9a732a8,x27=f9a73070
[ 6717.7400357]x28=f8abc4a1b400, fp=x29=ffe68510
[ 6717.7500291] lr=x30=f8abc49fe224, sp=ffe68420
[ 6717.7600300] 
[ 6717.7600300] cpu18: End traceback...
Stopped in pid 152.152 (conftest) atnetbsd:cpu_Debugger+0x4:ret
db{18}>





panic: kernel diagnostic assertion "l == curlwp"

2020-06-01 Thread Jared McNeill

Just hit this panic on 9.99.64:

[ 6717.5700161] panic: kernel diagnostic assertion "l == curlwp" failed: file 
"/home/source/ab/HEAD/src/sys/kern/kern_lwp.c", line 2063
[ 6717.5800161] cpu18: Begin traceback...
[ 6717.5900170] trace fp c0088f5d3c50
[ 6717.5900170] fp c0088f5d3c70 vpanic() at c04b0334 
netbsd:vpanic+0x15c
[ 6717.6000167] fp c0088f5d3ce0 kern_assert() at c07ce26c 
netbsd:kern_assert+0x5c
[ 6717.6100211] fp c0088f5d3d70 lwp_thread_cleanup() at c045f368 
netbsd:lwp_thread_cleanup+0x80
[ 6717.6200270] fp c0088f5d3d90 lwp_exit() at c045f4ac 
netbsd:lwp_exit+0xcc
[ 6717.6200270] fp c0088f5d3dd0 sys__lwp_create() at c04c2f00 
netbsd:sys__lwp_create+0xe8
[ 6717.6300215] fp c0088f5d3e20 syscall() at c008a63c 
netbsd:syscall+0x18c
[ 6717.6400221] tf c0088f5d3ed0 el0_trap() at c0088c74 
netbsd:el0_trap
[ 6717.6500214]  trapframe 0xc0088f5d3ed0 (304 bytes) 
[ 6717.6500214] pc=f8abc4638958,   spsr=2000
[ 6717.6600309]esr=56000135,far=f8abc465b030
[ 6717.6600309] x0=ffe68500, x1=f8abc4a1a260
[ 6717.6700228] x2=f8abc49fe200, x3=
[ 6717.6800338] x4=ffe684d0, x5=0030
[ 6717.6800338] x6=ffe68500, x7=f8abc4760208
[ 6717.6900234] x8=0001, x9=1003
[ 6717.6900234]x10=000c,x11=0001
[ 6717.7000252]x12=f8abc49e91d8,x13=
[ 6717.7000252]x14=,x15=f8abc4000980
[ 6717.7100242]x16=f8abc4a122e0,x17=f8abc4638954
[ 6717.7200306]x18=0001,x19=ffe68500
[ 6717.7200306]x20=f9a72e60,x21=0002
[ 6717.7300283]x22=f8abc4a1a260,x23=f8abc49fe200
[ 6717.7300283]x24=f9a72000,x25=ffe684d0
[ 6717.7400357]x26=f9a732a8,x27=f9a73070
[ 6717.7400357]x28=f8abc4a1b400, fp=x29=ffe68510
[ 6717.7500291] lr=x30=f8abc49fe224, sp=ffe68420
[ 6717.7600300] 
[ 6717.7600300] cpu18: End traceback...
Stopped in pid 152.152 (conftest) atnetbsd:cpu_Debugger+0x4:ret
db{18}>