Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Rejy M Cyriac Wed, 10 Jul 2013 01:00:25 -0700

On 07/10/2013 11:38 AM, Frank Sonntag wrote:
> Hi Greg,
> 
> Try using the same server on both machines when mounting, instead of mounting 
> off the local gluster server on both.
> I've used the same approach like you in the past and got into all kinds of 
> split-brain problems.
> The drawback of course is that mounts will fail if the machine you chose is 
> not available at mount time. It's one of my gripes with gluster that you 
> cannot list more than one server in your mount command.
> 
> Frank


Would not the mount option 'backupvolfile-server=<secondary server> help
at mount time, in the case of the primary server not being available ?

- rejy (rmc)


> 
> 
> 
> On 10/07/2013, at 5:26 PM, Greg Scott wrote:
> 
>> Bummer.   Looks like I’m on my own with this one.
>>  
>> -          Greg
>>  
>> From: gluster-users-boun...@gluster.org 
>> [mailto:gluster-users-boun...@gluster.org] On Behalf Of Greg Scott
>> Sent: Tuesday, July 09, 2013 12:37 PM
>> To: 'gluster-users@gluster.org'
>> Subject: Re: [Gluster-users] One node goes offline, the other node can't see 
>> the replicated volume anymore
>>  
>> No takers?   I am running gluster 3.4beta3 that came with Fedora 19.   Is my 
>> issue a consequence of some kind of quorum split-brain thing?
>>  
>> thanks
>>  
>> -          Greg Scott
>>  
>> From: gluster-users-boun...@gluster.org 
>> [mailto:gluster-users-boun...@gluster.org] On Behalf Of Greg Scott
>> Sent: Monday, July 08, 2013 8:17 PM
>> To: 'gluster-users@gluster.org'
>> Subject: [Gluster-users] One node goes offline, the other node can't see the 
>> replicated volume anymore
>>  
>> I don’t get this.  I have a replicated volume and 2 nodes.  My challenge is, 
>> when I take one node offline, the other node can no longer access the volume 
>> until both nodes are back online again.
>>  
>> Details:
>>  
>> I have 2 nodes, fw1 and fw2.   Each node has an XFS file system, 
>> /gluster-fw1 on node fw1 and gluster-fw2 no node fw2.   Node fw1 is at IP 
>> Address 192.168.253.1.  Node fw2 is at 192.168.253.2. 
>>  
>> I create a gluster volume named firewall-scripts which is a replica of those 
>> two XFS file systems.  The volume holds a bunch of config files common to 
>> both fw1 and fw2.  The application is an active/standby pair of firewalls 
>> and the idea is to keep config files in a gluster volume.
>>  
>> When both nodes are online, everything works as expected.  But when I take 
>> either node offline, node fw2 behaves badly:
>>  
>> [root@chicago-fw2 ~]# ls /firewall-scripts
>> ls: cannot access /firewall-scripts: Transport endpoint is not connected
>>  
>> And when I bring the offline node back online, node fw2 eventually behaves 
>> normally again. 
>>  
>> What’s up with that?  Gluster is supposed to be resilient and self-healing 
>> and able to stand up to this sort of abuse.  So I must be doing something 
>> wrong. 
>>  
>> Here is how I set up everything – it doesn’t get much simpler than this and 
>> my setup is right out the Getting Started Guide but using my own names. 
>>  
>> Here are the steps I followed, all from fw1:
>>  
>> gluster peer probe 192.168.253.2
>> gluster peer status
>>  
>> Create and start the volume:
>>  
>> gluster volume create firewall-scripts replica 2 transport tcp 
>> 192.168.253.1:/gluster-fw1 192.168.253.2:/gluster-fw2
>> gluster volume start firewall-scripts
>>  
>> On fw1:
>>  
>> mkdir /firewall-scripts
>> mount -t glusterfs 192.168.253.1:/firewall-scripts /firewall-scripts
>>  
>> and add this line to /etc/fstab:
>> 192.168.253.1:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 
>> 0 0
>>  
>> on fw2:
>>  
>> mkdir /firewall-scripts
>> mount -t glusterfs 192.168.253.2:/firewall-scripts /firewall-scripts
>>  
>> and add this line to /etc/fstab:
>> 192.168.253.2:/firewall-scripts /firewall-scripts glusterfs defaults,_netdev 
>> 0 0
>>  
>> That’s it.  That’s the whole setup.  When both nodes are online, everything 
>> replicates beautifully.  But take one node offline and it all falls apart. 
>>  
>> Here is the output from gluster volume info, identical on both nodes:
>>  
>> [root@chicago-fw1 etc]# gluster volume info
>>  
>> Volume Name: firewall-scripts
>> Type: Replicate
>> Volume ID: 239b6401-e873-449d-a2d3-1eb2f65a1d4c
>> Status: Started
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: 192.168.253.1:/gluster-fw1
>> Brick2: 192.168.253.2:/gluster-fw2
>> [root@chicago-fw1 etc]#
>>  
>> Looking at /var/log/glusterfs/firewall-scripts.log on fw2, I see errors like 
>> this every couple of seconds:
>>  
>> [2013-07-09 00:59:04.706390] I [afr-common.c:3856:afr_local_init] 
>> 0-firewall-scripts-replicate-0: no subvolumes up
>> [2013-07-09 00:59:04.706515] W [fuse-bridge.c:1132:fuse_err_cbk] 
>> 0-glusterfs-fuse: 3160: FLUSH() ERR => -1 (Transport endpoint is not 
>> connected)
>>  
>> And then when I bring fw1 back online, I see these messages on fw2:
>>  
>> [2013-07-09 01:01:35.006782] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 
>> 0-firewall-scripts-client-0: changing port to 49152 (from 0)
>> [2013-07-09 01:01:35.006932] W [socket.c:514:__socket_rwv] 
>> 0-firewall-scripts-client-0: readv failed (No data available)
>> [2013-07-09 01:01:35.018546] I 
>> [client-handshake.c:1658:select_server_supported_programs] 
>> 0-firewall-scripts-client-0: Using Program GlusterFS 3.3, Num (1298437), 
>> Version (330)
>> [2013-07-09 01:01:35.019273] I 
>> [client-handshake.c:1456:client_setvolume_cbk] 0-firewall-scripts-client-0: 
>> Connected to 192.168.253.1:49152, attached to remote volume '/gluster-fw1'.
>> [2013-07-09 01:01:35.019356] I 
>> [client-handshake.c:1468:client_setvolume_cbk] 0-firewall-scripts-client-0: 
>> Server and Client lk-version numbers are not same, reopening the fds
>> [2013-07-09 01:01:35.019441] I 
>> [client-handshake.c:1308:client_post_handshake] 0-firewall-scripts-client-0: 
>> 1 fds open - Delaying child_up until they are re-opened
>> [2013-07-09 01:01:35.020070] I 
>> [client-handshake.c:930:client_child_up_reopen_done] 
>> 0-firewall-scripts-client-0: last fd open'd/lock-self-heal'd - notifying 
>> CHILD-UP
>> [2013-07-09 01:01:35.020282] I [afr-common.c:3698:afr_notify] 
>> 0-firewall-scripts-replicate-0: Subvolume 'firewall-scripts-client-0' came 
>> back up; going online.
>> [2013-07-09 01:01:35.020616] I 
>> [client-handshake.c:450:client_set_lk_version_cbk] 
>> 0-firewall-scripts-client-0: Server lk version = 1
>>  
>> So how do I make glusterfs survive a node failure, which is the whole point 
>> of all this?
>>  
>> thanks
>>  
>> ·         Greg Scott
>>  
>>  
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] One node goes offline, the other node can't see the replicated volume anymore

Reply via email to