On Wed, 19 Jan 2011, Derrick Brashear wrote:

On Wed, Jan 19, 2011 at 10:30 PM, Toby Burress <[email protected]> wrote:
I was wondering if I could trouble to have someone double check my
diagnostic.

So when dismounting /afs, the master branch hangs.  It looks like
this is happening because osi_StopListener() in src/rx/FBSD/rx_knet.c
calls osi_NetSend() telling the Listener to go away, and then
afs_osi_Sleep().

Then in src/rx/rx_kcommon.c, rxk_ListenerProc() gets the signal and
calls osi_rxWakeup(), allowing osi_StopListener() to return and umount
to exit.

However, it looks like afs_osi_Sleep() is being called with
rxk_ListenerPid as its argument, and osi_rxWakeup() with afs_termState.
This causes afs_getevent to return the wrong event to osi_rxWakeup,
and as a result wakeup() is never called and umount hangs.

Editing rx_kcommon.c to use rxk_ListenerPid instead of afs_termState
allows umount to exit cleanly (although afsd isn't able to restart after
that;

it's not supposed to. unload the module, then reload it.

it looks like after the restart afs_getevent is being called with
something that just points to zeroed memory).

Is this all wrong?  I spend most of my time in pythonland, so kernel
debugging is, uh, new to me.

i'd have to look but that sounds correct

The same from me, with the addition that the shutdown code is known to be buggy in its present state and I haven't had much time to look at it. There is a lock order reversal involved in the psignal() call, IIRC, which has not been closely examined for deadlock potential. (Between the vnode lock for the ufs vnode of the /afs directory, and the allproc lock, IIRC. It should be in the jabber logs.)

Shutdown sometimes works by chance, when that codepath doesn't need to run. If you can reliably get it to (1) use that codepath and (2) shutdown cleanly, please submit to gerrit.
I might also recommend using the rc script found in
http://web.mit.edu/freebsd/openafs/openafs.shar .

-Ben Kaduk

Reply via email to