[nfs-discuss] webrev for CR 6768607

Dai Ngo Thu, 12 Mar 2009 11:46:54 -0700

Hi Frank,

Please see reply below:
Frank Batschulat (Home) wrote:
> On Thu, 12 Mar 2009 07:50:14 +0100, Dai Ngo <Dai.Ngo at sun.com> wrote:
>
>   
>> I'd like to have a code review for the change to fix CR 6768607.
>>
>> The problem was caused by an infinite loop in rfs4_cbinfo_hold().
>> This thread put a hold on the DBE of a rfs4_deleg_state causing the reaper
>> thread to be delayed (forever). Since the deleg_state table was not cleaned,
>> this caused the reaper threads of the file and client table to also be
>> delayed,
>> due to the hold on their DBEs from the deleg_state entries. I added detailed
>> analysis in the evaluation section of the CR.
>>
>> The fix is to limit the number of retries to 5 (5 secs).
>>
>> webrev:  http://cr.opensolaris.org/~dain/6768607/
>> CR: http://monaco.sfbay/detail.jsf?cr=6768607
>>     
>
> Dai, currently rfs4_cbinfo_hold() only returns NULL to its
> caller in the case of the callback state != CB_OK, the changes
> will introduce a new 2nd error return path for returning NULL without
> knowing about the callback state itself from the first blush,
> are we sure that it is impossible to return here with a cb_state of CB_OK ?
>   
This change takes a conservative approach, assuming that if the 
set_clientid_confirm does
not arrive within 5 secs  (configurable), after the set_clientid, then 
just actively revokes the
delegation and leaves the cbinfo_t as is. The change does not attempt to 
fix the problem with
the callback path that depends on set_clientid and set_clientid_confirm 
requests, but rather
the fix is not to get stuck in this loop causing the problem reported in 
the CR.
> ie. somehow leave a dangling rfs4_cbinfo_t with CB_OK around although we
> failed here for the caller. from the first blush it appears that there 
> shouldn't
> be any harm as we'll immediately fail with RPC_FAILED
>  in rfs4_do_callback().
>   
Yes, the call was returned with RPC_FAILED by rfs4_do_callback() and the 
delegation
is revoked by rfs4_do_cb_recall(). So I think we should be OK.


> fix looks technically correct to me, although the dance around with the 
> cb_lock
> looks somewhat ... hmm...blinding.
>   
Agree, but it makes the code looks consistent; protecting access to 
rfs4_cbinfo_t with cb_lock.
Since this is a fail path, performance should not be effected under 
normal condition.
> what I'd definitively request to see is extending the comments 
> preceeding rfs4_cbinfo_hold() to give hints about why our while loop
> goes:
>
>  355         while (cbp->cb_newer.cb_new == TRUE && cbp->cb_nullcaller == 
> FALSE) {
>
> and the new bailout happens the other way around with:
>
>  365                 if (cbp->cb_newer.cb_new == FALSE || cbp->cb_nullcaller 
> == TRUE)
>
> I'm pretty sure, after this has been putbacked, and 3 or 4 months went over,
> nobody can explain that code immediately anymore... ;-)
>   
I will add some comment here, and will update the webrev.

Thank you for reviewing the change,

-Dai
> thanks
> frankB
>
>
>
>
>
>

[nfs-discuss] webrev for CR 6768607

Reply via email to