Hi all, Is the 42s timeout tunable?
Should the default be made lower, eg. 3 second? Thanks. On Tue, Feb 11, 2014 at 3:37 PM, Kaushal M <kshlms...@gmail.com> wrote: > The 42 second hang is most likely the ping timeout of the client > translator. > > What most likely happened was that, the brick on annex3 was being used > for the read when you pulled its plug. When you pulled the plug, the > connection between the client and annex3 isn't gracefully terminated > and the client translator still sees the connection as alive. Because > of this the next fop is also sent to annex3, but it will timeout as > annex3 is dead. After the timeout happens, the connection is marked as > dead, and the associated client xlator is marked as down. Since afr > now know annex3 is dead, it sends the next fop to annex4 which is > still alive. > > These kinds of unclean connection terminations are only handled by > request/ping timeouts currently. You could set the ping timeout values > to be lower, to reduce the detection time. > > ~kaushal > > On Tue, Feb 11, 2014 at 11:57 AM, Krishnan Parthasarathi > <kpart...@redhat.com> wrote: > > James, > > > > Could you provide the logs of the mount process, where you see the hang > for 42s? > > My initial guess, seeing 42s, is that the client translator's ping > timeout > > is in play. > > > > I would encourage you to report a bug and attach relevant logs. > > If the issue (observed) turns out to be an acceptable/explicable > behavioural > > quirk of glusterfs, then we could close the bug :-) > > > > cheers, > > Krish > > ----- Original Message ----- > >> It's been a while since I did some gluster replication testing, so I > >> spun up a quick cluster *cough, plug* using puppet-gluster+vagrant (of > >> course) and here are my results. > >> > >> * Setup is a 2x2 distributed-replicated cluster > >> * Hosts are named: annex{1..4} > >> * Volume name is 'puppet' > >> * Client vm's mount (fuse) the volume. > >> > >> * On the client: > >> > >> # cd /mnt/gluster/puppet/ > >> # dd if=/dev/urandom of=random.51200 count=51200 > >> # sha1sum random.51200 > >> # rsync -v --bwlimit=10 --progress random.51200 root@localhost:/tmp > >> > >> * This gives me about an hour to mess with the bricks... > >> * By looking on the hosts directly, I see that the random.51200 file is > >> on annex3 and annex4... > >> > >> * On annex3: > >> # poweroff > >> [host shuts down...] > >> > >> * On client1: > >> # time ls > >> random.51200 > >> > >> real 0m42.705s > >> user 0m0.001s > >> sys 0m0.002s > >> > >> [hangs for about 42 seconds, and then returns successfully...] > >> > >> * I then powerup annex3, and then pull the plug on annex4. The same sort > >> of thing happens... It hangs for 42 seconds, but then everything works > >> as normal. This is of course the cluster timeout value and the answer to > >> life the universe and everything. > >> > >> Question: Why doesn't glusterfs automatically flip over to using the > >> other available host right away? If you agree, I'll report this as a > >> bug. If there's a way to do this, let me know. > >> > >> Apart from the delay, glad that this is of course still HA ;) > >> > >> Cheers, > >> James > >> @purpleidea (twitter/irc) > >> https://ttboj.wordpress.com/ > >> > >> > >> _______________________________________________ > >> Gluster-devel mailing list > >> gluster-de...@nongnu.org > >> https://lists.nongnu.org/mailman/listinfo/gluster-devel > >> > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@gluster.org > > http://supercolony.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-devel mailing list > gluster-de...@nongnu.org > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Sharuzzaman Ahmat Raslan
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users