[OpenAFS-devel] Cache corruption with RX busy code

Simon Wilkinson Fri, 12 Apr 2013 13:00:04 -0700

Hi,

I've been debugging a problem with the new RX packet busy code, that I thought 
it was worth discussing more widely.


Various things can cause a client and server to have differing views on the 
available call channels. When the client attempts to use a call channel that 
the server thinks is in use, the server responds with a BUSY packet. 
Originally, the client would just ignore this. It would then look like the 
server wasn't responding, and the client would keep retrying on that channel 
until either the call timed out, or the channel on the server was freed.

54fb96d2b6517ae491fd7a7c03246850d29156d5 changed this behaviour so that if a 
call is stuck on a busy channel, and there are other free channels on the 
connection, we time out the busy call, and allow the client to try again on a 
different channel. However, this opens up a potential race which can lead to 
cache corruption. The race permits the first attempt at the call to succeed on 
the server, but to appear to fail to the client. The client then retries the 
call, believing it failed. With operations that modify server state, this 
second attempt can fail because the change has already happened. The failure is 
returned to the user, and the client's cache isn't updated. So, the AFS cache 
no longer reflects the state on the server.

The race is as follows:

Client                                  Server
        
Sends 1st pkt of call to server         
                                        Receives 1st packet, but channel busy
                                            sets error on old call
                                            sends BUSY packet to client
RTT expires
    resends 1st pkt
                                        Old call terminates
Receives BUSY packet
    sets call busy flag
                                        Receives resent packet
                                            starts to processes call
                                            sends ACK for resent packet
RTT expires
    checks busy flag
    sets RX_CALL_BUSY error
                                        Sends response packet
Receives ACK packet
    discards it as call destroyed 

Receives response packet
    discards it as call destroyed

This sequence of events gives a call that succeeds on the server, but fails on 
the client. The retry then puts the cache into an inconsistent state.

This is exactly the same problem as we encountered with idle dead processing. A 
client can't unilaterally terminate an RPC to a server, because it actually has 
no way of knowing whether the RPC succeeded or not. If the client does 
terminate an RPC, it needs to invalidate all of the cached state that that 
client may have touched.

The question is whether just adding more cases where we invalidate the cache is 
the right approach, or whether we should reconsider the BUSY behaviour.

Cheers,

Simon

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

[OpenAFS-devel] Cache corruption with RX busy code

Reply via email to