Re: [Gluster-devel] Mount hangs because of connection delays

2015-07-02 Thread Ravishankar N



On 07/02/2015 07:04 PM, Pranith Kumar Karampuri wrote:

hi,
When glusterfs mount process is coming up all cluster xlators wait 
for at least one event from all the children before propagating the 
status upwards. Sometimes client xlator takes upto 2 minutes to 
propogate this 
event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to 
this xavi implemented timer in ec notify where we treat a child as 
down if it doesn't come up in 10 seconds. Similar patch went up for 
review @http://review.gluster.org/#/c/3 for afr. Kritika raised an 
interesting point in the review that all cluster xlators need to have 
this logic for the mount to not hang, and the correct place to fix it 
would be client xlator itself. i.e. add the timer logic in client 
xlator. Which seems like a better approach.


I think it makes sense to handle the change only in relevant cluster 
xlators like AFR/EC because of the notion of high availability 
associated with them. In my limited understanding, protocol-client is 
the originator (?) of the child up/down events. While it looks okay to 
allow cluster xlators to take certain decisions because the 'originator' 
did not respond within a specific time, altering the originator itself 
without giving a chance to the upper xlators to make choices seems 
incorrect to me.  Perhaps I'm wrong, but setting an unconditional 10 
second timer on protocol/client seems to beat the purpose of having a 
configurable `network.ping-timeout` volume set option.


Just my two cents. :)


I just want to take inputs from everyone before we go ahead in that 
direction.
i.e. on PARENT_UP in client xlator it will start a timer and if no rpc 
notification is received in that timeout it treats the client xlator 
as down.


Pranith


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Mount hangs because of connection delays

2015-07-02 Thread Shyam

Pranith,

I understand the bug and a more generic layer solution would be 
desirable and apt, rather than repeating things at each xlator.


However, I am always confused about notifications and its processing, so 
cannot state with conviction that this is fine and will work elegantly. 
Will leave others to chime in with the same.


Shyam

On 07/02/2015 09:34 AM, Pranith Kumar Karampuri wrote:

hi,
 When glusterfs mount process is coming up all cluster xlators wait
for at least one event from all the children before propagating the
status upwards. Sometimes client xlator takes upto 2 minutes to
propogate this
event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to
this xavi implemented timer in ec notify where we treat a child as down
if it doesn't come up in 10 seconds. Similar patch went up for review
@http://review.gluster.org/#/c/3 for afr. Kritika raised an
interesting point in the review that all cluster xlators need to have
this logic for the mount to not hang, and the correct place to fix it
would be client xlator itself. i.e. add the timer logic in client
xlator. Which seems like a better approach. I just want to take inputs
from everyone before we go ahead in that direction.
i.e. on PARENT_UP in client xlator it will start a timer and if no rpc
notification is received in that timeout it treats the client xlator as
down.

Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Mount hangs because of connection delays

2015-07-02 Thread Xavier Hernandez

I agree that a generic solution for all cluster xlators would be good.

Only question I have is whether parallel notifications are specially 
handled somewhere.


For example, if client xlator sends EC_CHILD_DOWN after a timeout, it's 
possible that an immediate EC_CHILD_UP is sent if the brick is 
connected. In this case, the cluster xlator could receive both 
notifications in any order (we have multi-threading), which is dangerous 
if EC_CHILD_DOWN is processed after EC_CHILD_UP.


I've seen that protocol/client doesn't send one notification until the 
previous one has been completed. However this assumes that there won't 
be any xlator that delays the notification (i.e. sends it in background 
at another moment). Is that a requirement to process notifications ? 
otherwise the concurrent notifications problem could appear even if 
protocol/client serializes them.


Xavi

On 07/02/2015 03:34 PM, Pranith Kumar Karampuri wrote:

hi,
 When glusterfs mount process is coming up all cluster xlators wait
for at least one event from all the children before propagating the
status upwards. Sometimes client xlator takes upto 2 minutes to
propogate this
event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to
this xavi implemented timer in ec notify where we treat a child as down
if it doesn't come up in 10 seconds. Similar patch went up for review
@http://review.gluster.org/#/c/3 for afr. Kritika raised an
interesting point in the review that all cluster xlators need to have
this logic for the mount to not hang, and the correct place to fix it
would be client xlator itself. i.e. add the timer logic in client
xlator. Which seems like a better approach. I just want to take inputs
from everyone before we go ahead in that direction.
i.e. on PARENT_UP in client xlator it will start a timer and if no rpc
notification is received in that timeout it treats the client xlator as
down.

Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Mount hangs because of connection delays

2015-07-02 Thread Pranith Kumar Karampuri

hi,
When glusterfs mount process is coming up all cluster xlators wait 
for at least one event from all the children before propagating the 
status upwards. Sometimes client xlator takes upto 2 minutes to 
propogate this 
event(https://bugzilla.redhat.com/show_bug.cgi?id=1054694#c0) Due to 
this xavi implemented timer in ec notify where we treat a child as down 
if it doesn't come up in 10 seconds. Similar patch went up for review 
@http://review.gluster.org/#/c/3 for afr. Kritika raised an 
interesting point in the review that all cluster xlators need to have 
this logic for the mount to not hang, and the correct place to fix it 
would be client xlator itself. i.e. add the timer logic in client 
xlator. Which seems like a better approach. I just want to take inputs 
from everyone before we go ahead in that direction.
i.e. on PARENT_UP in client xlator it will start a timer and if no rpc 
notification is received in that timeout it treats the client xlator as 
down.


Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel