On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote: > I've been finding egdb and gdb rather easily get stuck in an > uninterruptible wait, e.g. when running the 'next' command after > hitting a breakpoint. > > So it's not possible to kill the debuggee or gdb and the only way to > kill the debuggee process and free up its listening sockets seems to be > to reboot the entire system. > > Perhaps unsurprisingly one cannot attach a second invocation of gdb to > the uninterruptible gdb, so i don't know for sure what syscall is being > run that is getting stuck. > > The debuggee is a local build of the flightgear flight simulator. > > Here's the output of ps for the debugger and debuggee: > > 12419 p0 D 0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set > print thread-events off -ex set print pretty on -ex run --args > build-walk/fgfs,clang,debug,opt,co > 63921 p0 TX+ 0:42.45 > /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe > --airport=egtk (fgfs,clang,debug) > > I've tried using ktrace on egdb, and the kdump output ends like this: > > 53950 egdb CALL wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0) > 53950 egdb RET wait4 97562/0x17d1a > 53950 egdb CALL ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12) > 53950 egdb RET ptrace 0 > 53950 egdb CALL ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL > futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0) > 53950 egdb RET futex 0 > 53950 egdb CALL > futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0) > 53950 egdb RET futex 0 > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7ffffe8a30,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7ffffe8a30,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL ptrace(PT_STEP,97562,0x1,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL read(6,0x7f7ffffe9187,0x1) > 53950 egdb RET read -1 errno 35 Resource temporarily unavailable > 53950 egdb CALL poll(0x6441581e720,3,0) > 53950 egdb STRU struct pollfd [3] { fd=4, events=0x1<POLLIN>, > revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10, > events=0x1<POLLIN>, revents=0<> } > 53950 egdb RET poll 0 > 53950 egdb CALL wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0) > > Assuming that this is the actual end of the ktrace output and there > isn't some missing ktrace output in a buffer somewhere, this looks > like egdb is simply blocked in wait4(), which should be harmless and > certainly not uninterruptable?
The single-thread check done by wait4() is non-interruptible. When the debugger gets stuck, is it blocked in "suspend" state? However, I think there is a bug in the single-thread switch code. It looks that ps_singlecount can be decremented too much. This probably is a regression of making ps_singlecount unsigned and letting single_thread_check() run without the kernel lock. The bug might go away if single_thread_check() made sure that P_SUSPSINGLE is set before the thread suspends. Does the following patch help? Even if it does, it probably needs some refining. Index: kern/kern_sig.c =================================================================== RCS file: src/sys/kern/kern_sig.c,v retrieving revision 1.258 diff -u -p -r1.258 kern_sig.c --- kern/kern_sig.c 15 Jun 2020 13:18:33 -0000 1.258 +++ kern/kern_sig.c 20 Jul 2020 04:27:30 -0000 @@ -1915,16 +1915,23 @@ single_thread_check(struct proc *p, int return (EINTR); } - if (atomic_dec_int_nv(&pr->ps_singlecount) == 0) - wakeup(&pr->ps_singlecount); + SCHED_LOCK(s); + if (p->p_flag & P_SUSPSINGLE) { + if (atomic_dec_int_nv(&pr->ps_singlecount) == 0) + wakeup(&pr->ps_singlecount); + } else if ((p->p_flag & P_WEXIT) == 0) { + SCHED_UNLOCK(s); + CPU_BUSY_CYCLE(); + continue; + } if (pr->ps_flags & PS_SINGLEEXIT) { + SCHED_UNLOCK(s); KERNEL_LOCK(); exit1(p, 0, 0, EXIT_THREAD_NOCHECK); - KERNEL_UNLOCK(); + /* NOTREACHED */ } /* not exiting and don't need to unwind, so suspend */ - SCHED_LOCK(s); p->p_stat = SSTOP; mi_switch(); SCHED_UNLOCK(s);