On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> I've been finding egdb and gdb rather easily get stuck in an
> uninterruptible wait, e.g. when running the 'next' command after
> hitting a breakpoint.
> 
> So it's not possible to kill the debuggee or gdb and the only way to
> kill the debuggee process and free up its listening sockets seems to be
> to reboot the entire system.
> 
> Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> the uninterruptible gdb, so i don't know for sure what syscall is being
> run that is getting stuck.
> 
> The debuggee is a local build of the flightgear flight simulator.
> 
> Here's the output of ps for the debugger and debuggee:
> 
> 12419 p0  D        0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set 
> print thread-events off -ex set print pretty on -ex run --args 
> build-walk/fgfs,clang,debug,opt,co
> 63921 p0  TX+      0:42.45 
> /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe 
> --airport=egtk (fgfs,clang,debug)
> 
> I've tried using ktrace on egdb, and the kdump output ends like this:
> 
>  53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
>  53950 egdb     RET   wait4 97562/0x17d1a
>  53950 egdb     CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  
> futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
>  53950 egdb     RET   futex 0
>  53950 egdb     CALL  
> futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
>  53950 egdb     RET   futex 0
>  53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  ptrace(PT_STEP,97562,0x1,0)
>  53950 egdb     RET   ptrace 0
>  53950 egdb     CALL  read(6,0x7f7ffffe9187,0x1)
>  53950 egdb     RET   read -1 errno 35 Resource temporarily unavailable
>  53950 egdb     CALL  poll(0x6441581e720,3,0)
>  53950 egdb     STRU  struct pollfd [3] { fd=4, events=0x1<POLLIN>, 
> revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10, 
> events=0x1<POLLIN>, revents=0<> }
>  53950 egdb     RET   poll 0
>  53950 egdb     CALL  wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
> 
> Assuming that this is the actual end of the ktrace output and there
> isn't some missing ktrace output in a buffer somewhere, this looks
> like egdb is simply blocked in wait4(), which should be harmless and
> certainly not uninterruptable?

The single-thread check done by wait4() is non-interruptible.
When the debugger gets stuck, is it blocked in "suspend" state?

However, I think there is a bug in the single-thread switch code.
It looks that ps_singlecount can be decremented too much. This probably
is a regression of making ps_singlecount unsigned and letting
single_thread_check() run without the kernel lock.

The bug might go away if single_thread_check() made sure that
P_SUSPSINGLE is set before the thread suspends. 

Does the following patch help? Even if it does, it probably needs
some refining.

Index: kern/kern_sig.c
===================================================================
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c     15 Jun 2020 13:18:33 -0000      1.258
+++ kern/kern_sig.c     20 Jul 2020 04:27:30 -0000
@@ -1915,16 +1915,23 @@ single_thread_check(struct proc *p, int 
                                        return (EINTR);
                        }
 
-                       if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
-                               wakeup(&pr->ps_singlecount);
+                       SCHED_LOCK(s);
+                       if (p->p_flag & P_SUSPSINGLE) {
+                               if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
+                                       wakeup(&pr->ps_singlecount);
+                       } else if ((p->p_flag & P_WEXIT) == 0) {
+                               SCHED_UNLOCK(s);
+                               CPU_BUSY_CYCLE();
+                               continue;
+                       }
                        if (pr->ps_flags & PS_SINGLEEXIT) {
+                               SCHED_UNLOCK(s);
                                KERNEL_LOCK();
                                exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
-                               KERNEL_UNLOCK();
+                               /* NOTREACHED */
                        }
 
                        /* not exiting and don't need to unwind, so suspend */
-                       SCHED_LOCK(s);
                        p->p_stat = SSTOP;
                        mi_switch();
                        SCHED_UNLOCK(s);

Reply via email to