Re: [Gluster-devel] Mount hangs because of connection delays
On 07/02/2015 07:04 PM, Pranith Kumar Karampuri wrote: hi, When glusterfs mount process is coming up all cluster xlators wait for at least one event from all the children before propagating the status upwards. Sometimes client xlator takes upto 2 minutes to propogate this event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to this xavi implemented timer in ec notify where we treat a child as down if it doesn't come up in 10 seconds. Similar patch went up for review @http://review.gluster.org/#/c/3 for afr. Kritika raised an interesting point in the review that all cluster xlators need to have this logic for the mount to not hang, and the correct place to fix it would be client xlator itself. i.e. add the timer logic in client xlator. Which seems like a better approach. I think it makes sense to handle the change only in relevant cluster xlators like AFR/EC because of the notion of high availability associated with them. In my limited understanding, protocol-client is the originator (?) of the child up/down events. While it looks okay to allow cluster xlators to take certain decisions because the 'originator' did not respond within a specific time, altering the originator itself without giving a chance to the upper xlators to make choices seems incorrect to me. Perhaps I'm wrong, but setting an unconditional 10 second timer on protocol/client seems to beat the purpose of having a configurable `network.ping-timeout` volume set option. Just my two cents. :) I just want to take inputs from everyone before we go ahead in that direction. i.e. on PARENT_UP in client xlator it will start a timer and if no rpc notification is received in that timeout it treats the client xlator as down. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Mount hangs because of connection delays
Pranith, I understand the bug and a more generic layer solution would be desirable and apt, rather than repeating things at each xlator. However, I am always confused about notifications and its processing, so cannot state with conviction that this is fine and will work elegantly. Will leave others to chime in with the same. Shyam On 07/02/2015 09:34 AM, Pranith Kumar Karampuri wrote: hi, When glusterfs mount process is coming up all cluster xlators wait for at least one event from all the children before propagating the status upwards. Sometimes client xlator takes upto 2 minutes to propogate this event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to this xavi implemented timer in ec notify where we treat a child as down if it doesn't come up in 10 seconds. Similar patch went up for review @http://review.gluster.org/#/c/3 for afr. Kritika raised an interesting point in the review that all cluster xlators need to have this logic for the mount to not hang, and the correct place to fix it would be client xlator itself. i.e. add the timer logic in client xlator. Which seems like a better approach. I just want to take inputs from everyone before we go ahead in that direction. i.e. on PARENT_UP in client xlator it will start a timer and if no rpc notification is received in that timeout it treats the client xlator as down. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] Mount hangs because of connection delays
I agree that a generic solution for all cluster xlators would be good. Only question I have is whether parallel notifications are specially handled somewhere. For example, if client xlator sends EC_CHILD_DOWN after a timeout, it's possible that an immediate EC_CHILD_UP is sent if the brick is connected. In this case, the cluster xlator could receive both notifications in any order (we have multi-threading), which is dangerous if EC_CHILD_DOWN is processed after EC_CHILD_UP. I've seen that protocol/client doesn't send one notification until the previous one has been completed. However this assumes that there won't be any xlator that delays the notification (i.e. sends it in background at another moment). Is that a requirement to process notifications ? otherwise the concurrent notifications problem could appear even if protocol/client serializes them. Xavi On 07/02/2015 03:34 PM, Pranith Kumar Karampuri wrote: hi, When glusterfs mount process is coming up all cluster xlators wait for at least one event from all the children before propagating the status upwards. Sometimes client xlator takes upto 2 minutes to propogate this event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to this xavi implemented timer in ec notify where we treat a child as down if it doesn't come up in 10 seconds. Similar patch went up for review @http://review.gluster.org/#/c/3 for afr. Kritika raised an interesting point in the review that all cluster xlators need to have this logic for the mount to not hang, and the correct place to fix it would be client xlator itself. i.e. add the timer logic in client xlator. Which seems like a better approach. I just want to take inputs from everyone before we go ahead in that direction. i.e. on PARENT_UP in client xlator it will start a timer and if no rpc notification is received in that timeout it treats the client xlator as down. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
[Gluster-devel] Mount hangs because of connection delays
hi, When glusterfs mount process is coming up all cluster xlators wait for at least one event from all the children before propagating the status upwards. Sometimes client xlator takes upto 2 minutes to propogate this event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to this xavi implemented timer in ec notify where we treat a child as down if it doesn't come up in 10 seconds. Similar patch went up for review @http://review.gluster.org/#/c/3 for afr. Kritika raised an interesting point in the review that all cluster xlators need to have this logic for the mount to not hang, and the correct place to fix it would be client xlator itself. i.e. add the timer logic in client xlator. Which seems like a better approach. I just want to take inputs from everyone before we go ahead in that direction. i.e. on PARENT_UP in client xlator it will start a timer and if no rpc notification is received in that timeout it treats the client xlator as down. Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel