Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Frank Sonntag Wed, 10 Jul 2013 01:52:31 -0700

On 10/07/2013, at 7:59 PM, Rejy M Cyriac wrote:

> On 07/10/2013 11:38 AM, Frank Sonntag wrote:
>> Hi Greg,
>> 
>> Try using the same server on both machines when mounting, instead of 
>> mounting off the local gluster server on both.
>> I've used the same approach like you in the past and got into all kinds of 
>> split-brain problems.
>> The drawback of course is that mounts will fail if the machine you chose is 
>> not available at mount time. It's one of my gripes with gluster that you 
>> cannot list more than one server in your mount command.
>> 
>> Frank
> 
> Would not the mount option 'backupvolfile-server=<secondary server> help
> at mount time, in the case of the primary server not being available ?
> 
> - rejy (rmc)
I am still on 3.2 which does not have that option (as far as I know).
But thanks for bringing this up. Useful to know.
And the OP can make use of it of course.



Frank




> 
>> 
>> 
>> 
>> On 10/07/2013, at 5:26 PM, Greg Scott wrote:
>> 
>>> Bummer.   Looks like I’m on my own with this one.
>>> 
>>> -          Greg
>>> 
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Greg Scott
>>> Sent: Tuesday, July 09, 2013 12:37 PM
>>> To: '[email protected]'
>>> Subject: Re: [Gluster-users] One node goes offline, the other node can't 
>>> see the replicated volume anymore
>>> 
>>> No takers?   I am running gluster 3.4beta3 that came with Fedora 19.   Is 
>>> my issue a consequence of some kind of quorum split-brain thing?
>>> 
>>> thanks
>>> 
>>> -          Greg Scott
>>> 
>>> From: [email protected] 
>>> [mailto:[email protected]] On Behalf Of Greg Scott
>>> Sent: Monday, July 08, 2013 8:17 PM
>>> To: '[email protected]'
>>> Subject: [Gluster-users] One node goes offline, the other node can't see 
>>> the replicated volume anymore
>>> 
>>> I don’t get this.  I have a replicated volume and 2 nodes.  My challenge 
>>> is, when I take one node offline, the other node can no longer access the 
>>> volume until both nodes are back online again.
>>> 
>>> Details:
>>> 
>>> I have 2 nodes, fw1 and fw2.   Each node has an XFS file system, 
>>> /gluster-fw1 on node fw1 and gluster-fw2 no node fw2.   Node fw1 is at IP 
>>> Address 192.168.253.1.  Node fw2 is at 192.168.253.2. 
>>> 
>>> I create a gluster volume named firewall-scripts which is a replica of 
>>> those two XFS file systems.  The volume holds a bunch of config files 
>>> common to both fw1 and fw2.  The application is an active/standby pair of 
>>> firewalls and the idea is to keep config files in a gluster volume.
>>> 
>>> When both nodes are online, everything works as expected.  But when I take 
>>> either node offline, node fw2 behaves badly:
>>> 
>>> [root@chicago-fw2 ~]# ls /firewall-scripts
>>> ls: cannot access /firewall-scripts: Transport endpoint is not connected
>>> 
>>> And when I bring the offline node back online, node fw2 eventually behaves 
>>> normally again. 
>>> 
>>> What’s up with that?  Gluster is supposed to be resilient and self-healing 
>>> and able to stand up to this sort of abuse.  So I must be doing something 
>>> wrong. 
>>> 
>>> Here is how I set up everything – it doesn’t get much simpler than this and 
>>> my setup is right out the Getting Started Guide but using my own names. 
>>> 
>>> Here are the steps I followed, all from fw1:
>>> 
>>> gluster peer probe 192.168.253.2
>>> gluster peer status
>>> 
>>> Create and start the volume:
>>> 
>>> gluster volume create firewall-scripts replica 2 transport tcp 
>>> 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
>>> gluster volume start firewall-scripts
>>> 
>>> On fw1:
>>> 
>>> mkdir /firewall-scripts
>>> mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
>>> 
>>> and add this line to /etc/fstab:
>>> 192.168.253.1:/firewall-scripts /firewall-scripts glusterfs 
>>> defaults,_netdev 0 0
>>> 
>>> on fw2:
>>> 
>>> mkdir /firewall-scripts
>>> mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
>>> 
>>> and add this line to /etc/fstab:
>>> 192.168.253.2:/firewall-scripts /firewall-scripts glusterfs 
>>> defaults,_netdev 0 0
>>> 
>>> That’s it.  That’s the whole setup.  When both nodes are online, everything 
>>> replicates beautifully.  But take one node offline and it all falls apart. 
>>> 
>>> Here is the output from gluster volume info, identical on both nodes:
>>> 
>>> [root@chicago-fw1 etc]# gluster volume info
>>> 
>>> Volume Name: firewall-scripts
>>> Type: Replicate
>>> Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
>>> Status: Started
>>> Number of Bricks: 1 x 2 = 2
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: 192.168.253.1:/gluster-fw1
>>> Brick2: 192.168.253.2:/gluster-fw2
>>> [root@chicago-fw1 etc]#
>>> 
>>> Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors 
>>> like this every couple of seconds:
>>> 
>>> [2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 
>>> 0-firewall-scripts-replicate-0: no subvolumes up
>>> [2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 
>>> 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not 
>>> connected)
>>> 
>>> And then when I bring fw1 back online, I see these messages on fw2:
>>> 
>>> [2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 
>>> 0-firewall-scripts-client-0: changing port to 49152 (from 0)
>>> [2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 
>>> 0-firewall-scripts-client-0: readv failed (No data available)
>>> [2013-07-09 01:01:35.018546] I 
>>> [client-handshake.c:1658:select_server_supported_programs] 
>>> 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), 
>>> Version (330)
>>> [2013-07-09 01:01:35.019273] I 
>>> [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: 
>>> Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
>>> [2013-07-09 01:01:35.019356] I 
>>> [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: 
>>> Server and Client lk-version numbers are not same, reopening the fds
>>> [2013-07-09 01:01:35.019441] I 
>>> [client-handshake.c:1308:client_post_handshake] 
>>> 0-firewall-scripts-client-0: 1 fds open - Delaying child_up until they are 
>>> re-opened
>>> [2013-07-09 01:01:35.020070] I 
>>> [client-handshake.c:930:client_child_up_reopen_done] 
>>> 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying 
>>> CHILD-UP
>>> [2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 
>>> 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came 
>>> back up; going online.
>>> [2013-07-09 01:01:35.020616] I 
>>> [client-handshake.c:450:client_set_lk_version_cbk] 
>>> 0-firewall-scripts-client-0: Server lk version = 1
>>> 
>>> So how do I make glusterfs survive a node failure, which is the whole point 
>>> of all this?
>>> 
>>> thanks
>>> 
>>> ·         Greg Scott
>>> 
>>> 
>>> _______________________________________________
>>> Gluster-users mailing list
>>> [email protected]
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> 
>> _______________________________________________
>> Gluster-users mailing list
>> [email protected]
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> 
> 
> _______________________________________________
> Gluster-users mailing list
> [email protected]
> http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Reply via email to