Re: [OpenAFS-devel] Cache corruption with RX busy code

Simon Wilkinson Thu, 18 Apr 2013 07:52:02 -0700

On 17 Apr 2013, at 20:40, Benjamin Kaduk wrote:

> On Sat, 13 Apr 2013, Andrew Deason wrote:
>> Can't the race below still happen with this old behavior? Say you retry
>> the call 6 times before timing out (pulling numbers out of the air; I
>> don't remember how many it typically takes). The first 5 result in a
>> BUSY response. On the 6th, the server receives the packet and the call
>> channel is clear, but before it gets an ACK to the client, the client
>> times out the call. And the same thing you describe happens; the server
>> processes the request but the client thinks it failed.
> 
> My quick thought experiment agrees with Andrew that the race is still 
> possible with the old behavior.


Yes, the race is completely possible with the old behaviour. However, with the 
old behaviour the client only cancels a call when it times out. That takes in 
the order of 50 seconds, and (with an RTT of 20ms) requires 10 retries. With 
idle dead disabled, or a longer idle dead timeout, it will take much longer.

With the new behaviour, a single packet call on a fresh channel will be 
cancelled after a single retry (requiring about 60ms).

When the server receives a packet on a busy channel, it tries to tear down the 
existing call, by setting an RX call error. With the old behaviour we had 50 or 
so seconds for the application thread to notice this error, and terminate the 
call. With the new behaviour, the server has 40ms.

So, a race that was extremely unlikely becomes more probable.

> It seems rather like we should be invalidating the cache "whenever the client 
> thinks an RPC failed" (but we can be clever about it for RPCs that don't 
> change anything, etc.).


The critical thing is that we should be invalidating the cache whenever an RPC 
that changes server state is failed due to a timeout. We've already discussed 
adding this behaviour to deal with the idle dead mess - we just need to add it 
elsewhere.

Then, there is an additional question - is the hit you take from cache 
invalidation worth terminating a call for? That's going to be a trade off, but 
I suspect that the current behaviour just doesn't allow a server enough time to 
tidy up.

Cheers,

Simon

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] Cache corruption with RX busy code

Reply via email to