No takers?   I am running gluster 3.4beta3 that came with Fedora 19.   Is my 
issue a consequence of some kind of quorum split-brain thing?

thanks


-          Greg Scott

From: gluster-users-boun...@gluster.org 
[mailto:gluster-users-boun...@gluster.org] On Behalf Of Greg Scott
Sent: Monday, July 08, 2013 8:17 PM
To: 'gluster-users@gluster.org'
Subject: [Gluster-users] One node goes offline, the other node can't see the 
replicated volume anymore

I don't get this.  I have a replicated volume and 2 nodes.  My challenge is, 
when I take one node offline, the other node can no longer access the volume 
until both nodes are back online again.

Details:

I have 2 nodes, fw1 and fw2.   Each node has an XFS file system, /gluster-fw1 
on node fw1 and gluster-fw2 no node fw2.   Node fw1 is at IP Address 
192.168.253.1.  Node fw2 is at 192.168.253.2.

I create a gluster volume named firewall-scripts which is a replica of those 
two XFS file systems.  The volume holds a bunch of config files common to both 
fw1 and fw2.  The application is an active/standby pair of firewalls and the 
idea is to keep config files in a gluster volume.

When both nodes are online, everything works as expected.  But when I take 
either node offline, node fw2 behaves badly:

[root@chicago-fw2 ~]# ls /firewall-scripts
ls: cannot access /firewall-scripts: Transport endpoint is not connected

And when I bring the offline node back online, node fw2 eventually behaves 
normally again.

What's up with that?  Gluster is supposed to be resilient and self-healing and 
able to stand up to this sort of abuse.  So I must be doing something wrong.

Here is how I set up everything - it doesn't get much simpler than this and my 
setup is right out the Getting Started Guide but using my own names.

Here are the steps I followed, all from fw1:

gluster peer probe 192.168.253.2
gluster peer status

Create and start the volume:

gluster volume create firewall-scripts replica 2 transport tcp 
192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
gluster volume start firewall-scripts

On fw1:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

on fw2:

mkdir /firewall-scripts
mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts

and add this line to /etc/fstab:
192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 0 0

That's it.  That's the whole setup.  When both nodes are online, everything 
replicates beautifully.  But take one node offline and it all falls apart.

Here is the output from gluster volume info, identical on both nodes:

[root@chicago-fw1 etc]# gluster volume info

Volume Name: firewall-scripts
Type: Replicate
Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.253.1:/gluster-fw1
Brick2: 192.168.253.2:/gluster-fw2
[root@chicago-fw1 etc]#

Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors like 
this every couple of seconds:

[2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 
0-firewall-scripts-replicate-0: no subvolumes up
[2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 
0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not connected)

And then when I bring fw1 back online, I see these messages on fw2:

[2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 
0-firewall-scripts-client-0: changing port to 49152 (from 0)
[2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 
0-firewall-scripts-client-0: readv failed (No data available)
[2013-07-09 01:01:35.018546] I 
[client-handshake.c:1658:select_server_supported_programs] 
0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), 
Version (330)
[2013-07-09 01:01:35.019273] I [client-handshake.c:1456:client_setvolume_cbk] 
0-firewall-scripts-client-0: Connected to 192.168.253.1:49152, attached to 
remote volume '/gluster-fw1'.
[2013-07-09 01:01:35.019356] I [client-handshake.c:1468:client_setvolume_cbk] 
0-firewall-scripts-client-0: Server and Client lk-version numbers are not same, 
reopening the fds
[2013-07-09 01:01:35.019441] I [client-handshake.c:1308:client_post_handshake] 
0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are 
re-opened
[2013-07-09 01:01:35.020070] I 
[client-handshake.c:930:client_child_up_reopen_done] 
0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying 
CHILD-UP
[2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 
0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came back 
up; going online.
[2013-07-09 01:01:35.020616] I 
[client-handshake.c:450:client_set_lk_version_cbk] 0-firewall-scripts-client-0: 
Server lk version = 1

So how do I make glusterfs survive a node failure, which is the whole point of 
all this?

thanks

*         Greg Scott


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to