[Gluster-users] HA problems

Christopher Hawkins Fri, 07 May 2010 10:03:38 -0700

Hello, I have a problem now that was previously solved. In a simple setup with 
two servers and one client, the way I had things configured was that the client 
connected to a virtual IP that could fail back and forth to whatever server was 
available. This used to work. But I have not tested since 2.09 until today... 
And now instead of recovering after a brief timeout, the client never recovers 
and reports endless Stale NFS File handle errors in its log (though there is no 
NFS involved, just native gluster client).


So I tried the HA translator from testing. This also does not work. After I 
kill the primary server (listed first in the config file), an ls of the mount 
point hangs for a moment and then reports:
 
[r...@server2 glusterfs]# ls /mnt/test
ls: /mnt/test: Input/output error

Each attempted ls produces two errors in the client log as well, a "Transport 
endpoint is not connected" error followed by the "Input/output error". 


The client log shows this:

[2010-05-07 12:03:44] N [glusterfsd.c:1408:main] glusterfs: Successfully started
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] 
master2_root: Connected to 192.168.1.92:3399, attached to remote volume 
'threads1'.
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] 
master_root: Connected to 192.168.1.91:3399, attached to remote volume 
'threads1'.
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] 
master2_root: Connected to 192.168.1.92:3399, attached to remote volume 
'threads1'.
[2010-05-07 12:03:44] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse: FUSE 
inited with protocol versions: glusterfs 7.13 kernel 7.10
[2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] 
master_root: Connected to 192.168.1.91:3399, attached to remote volume 
'threads1'.


[....here I killed the primary server....]

[2010-05-07 12:06:17] E [client-protocol.c:415:client_ping_timer_expired] 
master_root: Server 192.168.1.91:3399 has not responded in the last 42 seconds, 
disconnecting.
[2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind] master_root: 
forced unwinding frame type(1) op(LOOKUP)
[2010-05-07 12:06:17] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) 
(op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:17] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 10: 
LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind] master_root: 
forced unwinding frame type(2) op(PING)
[2010-05-07 12:06:17] N [client-protocol.c:6994:notify] master_root: 
disconnected
[2010-05-07 12:06:17] E [socket.c:762:socket_connect_finish] master_root: 
connection to 192.168.1.91:3399 failed (No route to host)
[2010-05-07 12:06:21] E [socket.c:762:socket_connect_finish] master_root: 
connection to 192.168.1.91:3399 failed (No route to host)
[2010-05-07 12:06:21] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) 
(op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:21] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 11: 
LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:24] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) 
(op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:24] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 12: 
LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:26] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) 
(op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:26] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 13: 
LOOKUP() / => -1 (Input/output error)
[2010-05-07 12:06:39] E [ha.c:125:ha_lookup_cbk] ha: (child=master_root) 
(op_ret=-1 op_errno=Transport endpoint is not connected)
[2010-05-07 12:06:39] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 14: 
LOOKUP() / => -1 (Input/output error)

[.... here I powered the primary server back on....] 

[2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk] 
master_root: Connected to 192.168.1.91:3399, attached to remote volume 
'threads1'.
[2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk] 
master_root: Connected to 192.168.1.91:3399, attached to remote volume 
'threads1'.
--------- end log ------------

And after it came back, the client recovered and everything picked back up. But 
it seems I cannot get the client to consider any server other than the first 
one it connects to. I assume that if failing the primary servers IP address to 
another box doesn't work, then round robin DNS will also not work since they 
are essentially the same method (a different server with the same address). And 
since this used to work, this seems to be an unintended result. 

The server vol file has a single export and io-threads, and the client has just 
the two remote-subvolumes and the ha declaration like so:

volume ha
   type cluster/ha
   subvolumes master_root master2_root
end-volume 

Code base is Glusterfs version 3.04 compiled from source this morning. How can 
I troubleshoot?

Christopher Hawkins
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

[Gluster-users] HA problems

Reply via email to