On Thu, 12 Mar 2009 07:50:14 +0100, Dai Ngo <Dai.Ngo at sun.com> wrote:
> I'd like to have a code review for the change to fix CR 6768607. > > The problem was caused by an infinite loop in rfs4_cbinfo_hold(). > This thread put a hold on the DBE of a rfs4_deleg_state causing the reaper > thread to be delayed (forever). Since the deleg_state table was not cleaned, > this caused the reaper threads of the file and client table to also be > delayed, > due to the hold on their DBEs from the deleg_state entries. I added detailed > analysis in the evaluation section of the CR. > > The fix is to limit the number of retries to 5 (5 secs). > > webrev: http://cr.opensolaris.org/~dain/6768607/ > CR: http://monaco.sfbay/detail.jsf?cr=6768607 Dai, currently rfs4_cbinfo_hold() only returns NULL to its caller in the case of the callback state != CB_OK, the changes will introduce a new 2nd error return path for returning NULL without knowing about the callback state itself from the first blush, are we sure that it is impossible to return here with a cb_state of CB_OK ? ie. somehow leave a dangling rfs4_cbinfo_t with CB_OK around although we failed here for the caller. from the first blush it appears that there shouldn't be any harm as we'll immediately fail with RPC_FAILED in rfs4_do_callback(). fix looks technically correct to me, although the dance around with the cb_lock looks somewhat ... hmm...blinding. what I'd definitively request to see is extending the comments preceeding rfs4_cbinfo_hold() to give hints about why our while loop goes: 355 while (cbp->cb_newer.cb_new == TRUE && cbp->cb_nullcaller == FALSE) { and the new bailout happens the other way around with: 365 if (cbp->cb_newer.cb_new == FALSE || cbp->cb_nullcaller == TRUE) I'm pretty sure, after this has been putbacked, and 3 or 4 months went over, nobody can explain that code immediately anymore... ;-) thanks frankB