Hi,
I've been debugging a problem with the new RX packet busy code, that I thought
it was worth discussing more widely.
Various things can cause a client and server to have differing views on the
available call channels. When the client attempts to use a call channel that
the server thinks is in use, the server responds with a BUSY packet.
Originally, the client would just ignore this. It would then look like the
server wasn't responding, and the client would keep retrying on that channel
until either the call timed out, or the channel on the server was freed.
54fb96d2b6517ae491fd7a7c03246850d29156d5 changed this behaviour so that if a
call is stuck on a busy channel, and there are other free channels on the
connection, we time out the busy call, and allow the client to try again on a
different channel. However, this opens up a potential race which can lead to
cache corruption. The race permits the first attempt at the call to succeed on
the server, but to appear to fail to the client. The client then retries the
call, believing it failed. With operations that modify server state, this
second attempt can fail because the change has already happened. The failure is
returned to the user, and the client's cache isn't updated. So, the AFS cache
no longer reflects the state on the server.
The race is as follows:
Client Server
Sends 1st pkt of call to server
Receives 1st packet, but channel busy
sets error on old call
sends BUSY packet to client
RTT expires
resends 1st pkt
Old call terminates
Receives BUSY packet
sets call busy flag
Receives resent packet
starts to processes call
sends ACK for resent packet
RTT expires
checks busy flag
sets RX_CALL_BUSY error
Sends response packet
Receives ACK packet
discards it as call destroyed
Receives response packet
discards it as call destroyed
This sequence of events gives a call that succeeds on the server, but fails on
the client. The retry then puts the cache into an inconsistent state.
This is exactly the same problem as we encountered with idle dead processing. A
client can't unilaterally terminate an RPC to a server, because it actually has
no way of knowing whether the RPC succeeded or not. If the client does
terminate an RPC, it needs to invalidate all of the cached state that that
client may have touched.
The question is whether just adding more cases where we invalidate the cache is
the right approach, or whether we should reconsider the BUSY behaviour.
Cheers,
Simon
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel