On Thu, Feb 1, 2018 at 2:48 PM, Shyam Ranganathan <srang...@redhat.com> wrote:
> On 02/01/2018 08:25 AM, Xavi Hernandez wrote: > > After having tried several things, it seems that it will be complex to > > solve these races. All attempts to fix them have caused failures in > > other connections. Since I've other work to do and it doesn't seem to be > > causing serious failures in production, for now I'll leave this. I'll > > retake this when I've more time. > > Xavi, convert the findings into a bug, and post the details there, so > that it may be followed up? (if not already done) > I've just created this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1541032 > > > > Xavi > > > > On Mon, Jan 29, 2018 at 11:07 PM, Xavi Hernandez <jaher...@redhat.com > > <mailto:jaher...@redhat.com>> wrote: > > > > Hi all, > > > > I've identified a race in RPC layer that caused some spurious > > disconnections and CHILD_DOWN notifications. > > > > The problem happens when protocol/client reconfigures a connection > > to move from glusterd to glusterfsd. This is done by calling > > rpc_clnt_reconfig() followed by rpc_transport_disconnect(). > > > > This seems fine because client_rpc_notify() will call > > rpc_clnt_cleanup_and_start() when the disconnect notification is > > received. However There's a problem. > > > > Suppose that the disconnection notification has been executed and we > > are just about to call rpc_clnt_cleanup_and_start(). If at this > > point the reconnection timer is fired, rpc_clnt_reconnect() will be > > processed. This will cause the socket to be reconnected and a > > connection notification will be processed. Then a handshake request > > will be sent to the server. > > > > However, when rpc_clnt_cleanup_and_start() continues, all sent XID's > > are deleted. When we receive the answer from the handshake, we are > > unable to map the XID, making the request to fail. So the handshake > > fails and the client is considered down, sending a CHILD_DOWN > > notification to upper xlators. > > > > This causes, in some tests, to start processing things while a brick > > is down unexpectedly, causing spurious failures on the test. > > > > To solve the problem I've forced the rpc_clnt_reconfig() function to > > disable the RPC connection using similar code to rcp_clnt_disable(). > > This prevents the background rpc_clnt_reconnect() timer to be > > executed, avoiding the problem. > > > > This seems to work fine for many tests, but it seems to be causing > > some issue in gfapi based tests. I'm still investigating this. > > > > Xavi > > > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@gluster.org > > http://lists.gluster.org/mailman/listinfo/gluster-devel > > >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel