https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=293382
Bug ID: 293382
Summary: Dead lock and kernel crash around closefp_impl
Product: Base System
Version: 14.3-STABLE
Hardware: Any
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: [email protected]
Reporter: [email protected]
Hi!
We've been using 14.4-STABLE for some time now and today a weird issue has
popped up. All of the sudden, our multi-threaded network app has deadlocked on
some threads, but not on others. We weren't able to neither attach to it with
GDB nor kill it with -9. Hard lock inside the kernel. We've managed to collect
a few samples of kernel backtrace for this process with `procstat -kk`. All,
basically, identical:
PID TID COMM TDNAME KSTACK
91545 101569 <redacted> - mi_switch+0xbd
_sx_xlock_hard+0x4ef kern_close+0x179 amd64_syscall+0x117
fast_syscall_common+0xf8
91545 102281 <redacted> - mi_switch+0xbd
_sx_xlock_hard+0x4ef kern_close+0x179 amd64_syscall+0x117
fast_syscall_common+0xf8
91545 102282 <redacted> <redacted-1> mi_switch+0xbd
sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1
umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54
sys__umtx_op+0x7e amd64_syscall+0x117 fast_syscall_common+0xf8
91545 102283 <redacted> <redacted-2> mi_switch+0xbd
sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1
kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b
kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117
fast_syscall_common+0xf8
91545 102284 <redacted> <redacted-3> mi_switch+0xbd
_sleep+0x1f3 knote_fdclose+0xac closefp_impl+0xd0 amd64_syscall+0x117
fast_syscall_common+0xf8
91545 102285 <redacted> <redacted-4> mi_switch+0xbd
sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1
kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b
kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117
fast_syscall_common+0xf8
91545 102286 <redacted> <redacted-5> mi_switch+0xbd
sleepq_catch_signals+0x2a2 sleepq_timedwait_sig+0x12 _sleep+0x1c1
kqueue_scan+0xa11 kqueue_kevent+0x13b kern_kevent_fp+0x4b
kern_kevent_generic+0xdf sys_kevent+0x61 amd64_syscall+0x117
fast_syscall_common+0xf8
Apparently, three threads were deadlocked: first two, that are unnamed and
`<redacted-3>`. The last one is the thread that is handling inbound socket
connections. Hundreds of thousands of them, mostly WebSocket. Two other threads
also use sockets, but for outbound connections. During normal operation,
sockets are being open and closed as needed, obviously. Seems like in some case
this may lead to a deadlock. Where one thread enters some state in kernel where
it hangs, holding the lock and preventing others from closing (or modifying
descriptors generally). App is async and uses kqueue for networking sockets
extensively. We suspect `<redacted-3>` to be the culprit, specifically its
backtrace where `closefp_impl` is involved.
And here's why. When this happened and the traffic was switched to a redundancy
server, it almost immediately panicked and wend into reboot. Hopefully, we've
got the core dump and were able to analyze it somewhat. And there we saw
`closefp_impl` from within the same (not physically, different server) thread
`<redacted-3>`:
Fatal trap 12: page fault while in kernel mode
cpuid = 22; apic id = 52
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80572e28
stack pointer = 0x28:0xfffffe071c126d70
frame pointer = 0x28:0xfffffe071c126dc0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 58518 (<redacted-3>)
rdi: fffff83402622be0 rsi: 0000000000000000 rdx: 0000000000000000
rcx: 0000000000000000 r8: fffff80160b9c520 r9: fffffe071c127000
rax: 0000000000000000 rbx: 0000000000031361 rbp: fffffe071c126dc0
r10: 0000000000000001 r11: 0000000000002af8 r12: fffff80160b9c000
r13: fffff87af7163e18 r14: fffff83402622be0 r15: fffff87af7163e00
trap number = 12
panic: page fault
cpuid = 22
time = 1771839128
KDB: stack backtrace:
#0 0xffffffff8061303d at kdb_backtrace+0x5d
#1 0xffffffff805c8091 at vpanic+0x161
#2 0xffffffff805c7f23 at panic+0x43
#3 0xffffffff80972f00 at trap_pfault+0x3e0
#4 0xffffffff8094af68 at calltrap+0x8
#5 0xffffffff8056b750 at closefp_impl+0xd0
#6 0xffffffff80973847 at amd64_syscall+0x117
#7 0xffffffff8094b85b at fast_syscall_common+0xf8
When inspecting it's kernel stack:
(kgdb) bt
#0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1 doadump (textdump=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:405
#2 0xffffffff805c7beb in kern_reboot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:523
#3 0xffffffff805c80e9 in vpanic (fmt=0xffffffff809d2ae7 "%s",
ap=ap@entry=0xfffffe071c126c30) at /usr/src/sys/kern/kern_shutdown.c:967
#4 0xffffffff805c7f23 in panic (fmt=<unavailable>) at
/usr/src/sys/kern/kern_shutdown.c:891
#5 0xffffffff80972f00 in trap_fatal (frame=<optimized out>, eva=<optimized
out>) at /usr/src/sys/amd64/amd64/trap.c:1000
#6 0xffffffff80972f00 in trap_pfault (frame=0xfffffe071c126cb0,
usermode=false, signo=<optimized out>, ucode=<optimized out>)
#7 <signal handler called>
#8 0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0,
td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730
#9 knote_fdclose (td=0xfffff80160b9c000, fd=201569) at
/usr/src/sys/kern/kern_event.c:2695
#10 0xffffffff8056b750 in closefp_impl (fdp=0xfffffe0d1582a920, fd=0,
fp=0xfffff81090d2c5a0, td=0xfffff80160b9c000, audit=true) at
/usr/src/sys/kern/kern_descrip.c:1320
#11 0xffffffff80973847 in syscallenter (td=0xfffff80160b9c000) at
/usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:193
#12 amd64_syscall (td=0xfffff80160b9c000, traced=0) at
/usr/src/sys/amd64/amd64/trap.c:1241
#13 <signal handler called>
#14 0x000000082deed32a in ?? ()
Backtrace stopped: Cannot access memory at address 0x85d08dbc8
Within `knote_drop` we observe a null pointer access:
(kgdb) fr 8
#8 0xffffffff80572e28 in knote_drop (kn=0xfffff83402622be0,
td=0xfffff80160b9c000) at /usr/src/sys/kern/kern_event.c:2730
2730 kn->kn_fop->f_detach(kn);
(kgdb) l
2725 static void
2726 knote_drop(struct knote *kn, struct thread *td)
2727 {
2728
2729 if ((kn->kn_status & KN_DETACHED) == 0)
2730 kn->kn_fop->f_detach(kn);
2731 knote_drop_detached(kn, td);
2732 }
2733
2734 static void
(kgdb) p kn->kn_fop
$2 = (const struct filterops *) 0x0
If you need more info, please ask. We will be glad to provide it.
---------------
System info:
FreeBSD frv21.ukr.net 14.4-STABLE FreeBSD 14.4-STABLE
stable/14-n273658-2f91ff89c56e FRV21 amd64 1404500 1404500
--
You are receiving this mail because:
You are the assignee for the bug.