2008/7/11 Simon 'corecode' Schubert <[EMAIL PROTECTED]>: > Nicolas Thery wrote: >> >> I'm looking into this. There is a deadlock involving the gdb lwp and >> 2 vkernel lwps. I hope to have a clearer understanding and a fix this >> week-end. > > Great! When you're saying you are looking into this, I don't have a doubt > that you will find the cause :)
Thanks for your trust ;-) There is indeed a deadlock: - The initial vkernel thread is sleeping on the user mutex associated with the vkd cothread. - The vkd cothread sends a SIGIO (lwp_kill(2)) to the initial thread to simulate an interrupt. The initial thread's sleep is interrupted and it is made runnable. - When the cothread is about to return to userland from lwp_kill(2), it is preempted in userexit() and the initial thread runs. - The initial thread handles the signal (issignal() called from tsleep()). As the process is being debugged, proc_stop() is called, the process moves to SSTOP and the initial thread is stopped (tstop()). - The cothread is then awakened and goes back to userland (that's a bug, it should stop too). - The cothread eventually waits on its condition variable. - Meanwhile GDB blocks on wait(2) forever because only one lwp out of two is stopped (p_nstopped < p_nthreads). The kernel tests if the lwp should be stopped in userret(): if (p->p_stat == SSTOP) { get_mplock(); tstop(); rel_mplock(); goto recheck; } However, userret() is called *before* userexit() and the cothread is not stopped. To confirm this hypothesis, I added the above code (with the if turned into a while) in userexit() after the preemption points and the vkernel booted fine. I observed another hang during shutdown though. I'm not sure this is a correct fix. I'll study this into more details tomorrow. Another mystery remains: what change caused this regression? Some cvs annotate on various files didn't point to the culprit.