Hi Frank, Please see reply below: Frank Batschulat (Home) wrote: > On Thu, 12 Mar 2009 07:50:14 +0100, Dai Ngo <Dai.Ngo at sun.com> wrote: > > >> I'd like to have a code review for the change to fix CR 6768607. >> >> The problem was caused by an infinite loop in rfs4_cbinfo_hold(). >> This thread put a hold on the DBE of a rfs4_deleg_state causing the reaper >> thread to be delayed (forever). Since the deleg_state table was not cleaned, >> this caused the reaper threads of the file and client table to also be >> delayed, >> due to the hold on their DBEs from the deleg_state entries. I added detailed >> analysis in the evaluation section of the CR. >> >> The fix is to limit the number of retries to 5 (5 secs). >> >> webrev: http://cr.opensolaris.org/~dain/6768607/ >> CR: http://monaco.sfbay/detail.jsf?cr=6768607 >> > > Dai, currently rfs4_cbinfo_hold() only returns NULL to its > caller in the case of the callback state != CB_OK, the changes > will introduce a new 2nd error return path for returning NULL without > knowing about the callback state itself from the first blush, > are we sure that it is impossible to return here with a cb_state of CB_OK ? > This change takes a conservative approach, assuming that if the set_clientid_confirm does not arrive within 5 secs (configurable), after the set_clientid, then just actively revokes the delegation and leaves the cbinfo_t as is. The change does not attempt to fix the problem with the callback path that depends on set_clientid and set_clientid_confirm requests, but rather the fix is not to get stuck in this loop causing the problem reported in the CR. > ie. somehow leave a dangling rfs4_cbinfo_t with CB_OK around although we > failed here for the caller. from the first blush it appears that there > shouldn't > be any harm as we'll immediately fail with RPC_FAILED > in rfs4_do_callback(). > Yes, the call was returned with RPC_FAILED by rfs4_do_callback() and the delegation is revoked by rfs4_do_cb_recall(). So I think we should be OK.
> fix looks technically correct to me, although the dance around with the > cb_lock > looks somewhat ... hmm...blinding. > Agree, but it makes the code looks consistent; protecting access to rfs4_cbinfo_t with cb_lock. Since this is a fail path, performance should not be effected under normal condition. > what I'd definitively request to see is extending the comments > preceeding rfs4_cbinfo_hold() to give hints about why our while loop > goes: > > 355 while (cbp->cb_newer.cb_new == TRUE && cbp->cb_nullcaller == > FALSE) { > > and the new bailout happens the other way around with: > > 365 if (cbp->cb_newer.cb_new == FALSE || cbp->cb_nullcaller > == TRUE) > > I'm pretty sure, after this has been putbacked, and 3 or 4 months went over, > nobody can explain that code immediately anymore... ;-) > I will add some comment here, and will update the webrev. Thank you for reviewing the change, -Dai > thanks > frankB > > > > > >