On Thu, 12 Mar 2009 07:50:14 +0100, Dai Ngo <Dai.Ngo at sun.com> wrote:

> I'd like to have a code review for the change to fix CR 6768607.
>
> The problem was caused by an infinite loop in rfs4_cbinfo_hold().
> This thread put a hold on the DBE of a rfs4_deleg_state causing the reaper
> thread to be delayed (forever). Since the deleg_state table was not cleaned,
> this caused the reaper threads of the file and client table to also be
> delayed,
> due to the hold on their DBEs from the deleg_state entries. I added detailed
> analysis in the evaluation section of the CR.
>
> The fix is to limit the number of retries to 5 (5 secs).
>
> webrev:  http://cr.opensolaris.org/~dain/6768607/
> CR: http://monaco.sfbay/detail.jsf?cr=6768607

Dai, currently rfs4_cbinfo_hold() only returns NULL to its
caller in the case of the callback state != CB_OK, the changes
will introduce a new 2nd error return path for returning NULL without
knowing about the callback state itself from the first blush,
are we sure that it is impossible to return here with a cb_state of CB_OK ?

ie. somehow leave a dangling rfs4_cbinfo_t with CB_OK around although we
failed here for the caller. from the first blush it appears that there shouldn't
be any harm as we'll immediately fail with RPC_FAILED in rfs4_do_callback().

fix looks technically correct to me, although the dance around with the cb_lock
looks somewhat ... hmm...blinding.

what I'd definitively request to see is extending the comments 
preceeding rfs4_cbinfo_hold() to give hints about why our while loop
goes:

 355         while (cbp->cb_newer.cb_new == TRUE && cbp->cb_nullcaller == 
FALSE) {

and the new bailout happens the other way around with:

 365                 if (cbp->cb_newer.cb_new == FALSE || cbp->cb_nullcaller == 
TRUE)

I'm pretty sure, after this has been putbacked, and 3 or 4 months went over,
nobody can explain that code immediately anymore... ;-)

thanks
frankB






Reply via email to