Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-11 Thread James
On Tue, Feb 11, 2014 at 9:43 AM, David Bierce wrote: > When the timeout is reached for the brick failed brick, it does have to > recreate handles for all the files in the volume, which is apparently quite > an expensive operation. In our environment, with only 100s of files, this > has been li

Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-11 Thread James
Thanks to everyone for their replies... On Tue, Feb 11, 2014 at 2:37 AM, Kaushal M wrote: > The 42 second hang is most likely the ping timeout of the client translator. Indeed I think it is... > > What most likely happened was that, the brick on annex3 was being used > for the read when you pull

Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-11 Thread David Bierce
Isn’t that working as intended? I’ve had the fuse client failover just fine in testing and in production when a memory error caused a kernel panic. That timeout is tunable, but when a brick in the cluster goes down writes by the client are suspended until the timeout is reached. In our environ

Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-11 Thread Sharuzzaman Ahmat Raslan
Hi all, Is the 42s timeout tunable? Should the default be made lower, eg. 3 second? Thanks. On Tue, Feb 11, 2014 at 3:37 PM, Kaushal M wrote: > The 42 second hang is most likely the ping timeout of the client > translator. > > What most likely happened was that, the brick on annex3 was bei

Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-11 Thread haiwei.xie-soulinfo
It's interesting problem, after 42s, your client will be aware of some bricks offline, io will continue. if your app's timeout is too short, error will occur. If ping timeout is too lower, maybe trouble in heavy io environment. On Tue, 11 Feb 2014 13:07:36 +0530 Kaushal M wrote: > The

Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-10 Thread Kaushal M
The 42 second hang is most likely the ping timeout of the client translator. What most likely happened was that, the brick on annex3 was being used for the read when you pulled its plug. When you pulled the plug, the connection between the client and annex3 isn't gracefully terminated and the clie

Re: [Gluster-users] [Gluster-devel] Testing replication and HA

2014-02-10 Thread Krishnan Parthasarathi
James, Could you provide the logs of the mount process, where you see the hang for 42s? My initial guess, seeing 42s, is that the client translator's ping timeout is in play. I would encourage you to report a bug and attach relevant logs. If the issue (observed) turns out to be an acceptable/exp