Re: [Gluster-users] Fwd: VM freeze issue on simple gluster setup.

2019-12-12 Thread WK


On 12/12/2019 4:34 AM, Ravishankar N wrote:



On 12/12/19 4:01 am, WK wrote:


 so I can get some sort of resolution on the issue (i.e. is it 
hardware, Gluster etc)


I guess what I really need to know is

1) Node 2 complains that it cant reach node 1 and node 3.  If this 
was an OS/Hardware networking issue and not internal to Gluster , 
then why didn't node1 and node3 have error message complaining about 
not reaching node2


[2019-12-05 22:00:43.739804] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
server 10.255.1.1:49153 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757095] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
server 10.255.1.3:49152 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757191] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected 
from GL1image-client-2. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757246] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected 
from GL1image-client-1. Client process will keep trying to connect to 
glusterd until brick's port is available


[2019-12-05 22:00:43.757266] W [MSGID: 108001] 
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum 
is not met


This seems to indicate the mount on node 2 cannot reach 2 bricks. If 
quorum is not met, you will get ENOTCONN on the mount. Maybe check if 
the mount is still disconnected from the bricks (either statedump or 
looking at the .meta folder)?


ok, it is a localhost fuse mount if that is more needed information. 
Should we be mounting on the actual IP of the Gluster Network?


The gluster setup on Node 2 returned to normal 11 seconds later with the 
mount reconnecting and every thing was fine when we were finally notfied 
of a problem and investigated (the VM lockup had already occured)


I'm not sure about the client port change from 0 back to 49153. Is that 
a clue? where did port 0 come from?


So is this an OS/Fuse problem with just the Node 2 mount locally 
becoming "confused" and then recovering?


Again Node 1 and Node 3 were happily reaching Node 2 from their 
perspective while this is was occurring. They never lost their 
connection to Node 2 from their perspective.



[2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
0-GL1image-client-2: changing port to 49153 (from 0)
[2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig] 
0-GL1image-client-1: changing port to 49152 (from 0)
[2019-12-05 22:00:54.820394] I [MSGID: 114046] 
[client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1: 
Connected to GL1image-client-1, attached to remote volume 
'/GLUSTER/GL1image'.
[2019-12-05 22:00:54.820447] I [MSGID: 114042] 
[client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 10 
fds open - Delaying child_up until they are re-opened
[2019-12-05 22:00:54.820549] I [MSGID: 114046] 
[client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2: 
Connected to GL1image-client-2, attached to remote volume 
'/GLUSTER/GL1image'.
[2019-12-05 22:00:54.820568] I [MSGID: 114042] 
[client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 10 
fds open - Delaying child_up until they are re-opened
[2019-12-05 22:00:54.821381] I [MSGID: 114041] 
[client-handshake.c:318:client_child_up_reopen_done] 
0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2019-12-05 22:00:54.821406] I [MSGID: 108002] 
[afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum is met
[2019-12-05 22:00:54.821446] I [MSGID: 114041] 
[client-handshake.c:318:client_child_up_reopen_done] 
0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP



In the meantime, we reupped the timeout to the default of 42 seconds 
which would have prevented the VM freeze. I suspect there was reason 
that is the default


-wk







 Forwarded Message 
Subject:VM freeze issue on simple gluster setup.
Date:   Thu, 5 Dec 2019 16:23:35 -0800
From:   WK 
To: Gluster Users 



I have a replica2+arbiter setup that is used for VMs.

ip #.1 is the arb

ip #.2 and #.3 are the kvm hosts.

Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse The 
Gluster networking uses a  two ethernet card teamd/round-robin setup 
which *should* have stayed up if one of the ports had failed.


I just had a number of VMs go Read-Only due to the below 
communication failure at 22:00 but only on kvm host  #2


VMs on the same gluster volumes on kvm host 3 were unaffected.

The logs on host #2 show the following:

[2019-12-05 22:00:43.739804] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 
0-GL1image-client-2: server 10.255.1.1:49153 has not responded in the 
last 21 seconds, disconnecting.
[2019-12-05 22:00:43.757095] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 

Re: [Gluster-users] Fwd: VM freeze issue on simple gluster setup.

2019-12-12 Thread Ravishankar N


On 12/12/19 4:01 am, WK wrote:


 so I can get some sort of resolution on the issue (i.e. is it 
hardware, Gluster etc)


I guess what I really need to know is

1) Node 2 complains that it cant reach node 1 and node 3.  If this was 
an OS/Hardware networking issue and not internal to Gluster , then why 
didn't node1 and node3 have error message complaining about not 
reaching node2


[2019-12-05 22:00:43.739804] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
server 10.255.1.1:49153 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757095] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
server 10.255.1.3:49152 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757191] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected from 
GL1image-client-2. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757246] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected from 
GL1image-client-1. Client process will keep trying to connect to 
glusterd until brick's port is available


[2019-12-05 22:00:43.757266] W [MSGID: 108001] 
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum is 
not met


This seems to indicate the mount on node 2 cannot reach 2 bricks. If 
quorum is not met, you will get ENOTCONN on the mount. Maybe check if 
the mount is still disconnected from the bricks (either statedump or 
looking at the .meta folder)?


2) how significant is it that the node was running 6.5 while node 1 
and node 2 were running 6.4


Minor versions should be fine but it is always a good idea to have all 
nodes on the same version.


HTH,
Ravi


-wk

 Forwarded Message 
Subject:VM freeze issue on simple gluster setup.
Date:   Thu, 5 Dec 2019 16:23:35 -0800
From:   WK 
To: Gluster Users 



I have a replica2+arbiter setup that is used for VMs.

ip #.1 is the arb

ip #.2 and #.3 are the kvm hosts.

Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse The 
Gluster networking uses a  two ethernet card teamd/round-robin setup 
which *should* have stayed up if one of the ports had failed.


I just had a number of VMs go Read-Only due to the below communication 
failure at 22:00 but only on kvm host  #2


VMs on the same gluster volumes on kvm host 3 were unaffected.

The logs on host #2 show the following:

[2019-12-05 22:00:43.739804] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2: 
server 10.255.1.1:49153 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757095] C 
[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1: 
server 10.255.1.3:49152 has not responded in the last 21 seconds, 
disconnecting.
[2019-12-05 22:00:43.757191] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnected 
from GL1image-client-2. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757246] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnected 
from GL1image-client-1. Client process will keep trying to connect to 
glusterd until brick's port is available
[2019-12-05 22:00:43.757266] W [MSGID: 108001] 
[afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorum 
is not met
[2019-12-05 22:00:43.790639] E [rpc-clnt.c:346:saved_frames_unwind] 
(--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.x 
v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456 (xid=0x825bffb)
[2019-12-05 22:00:43.790655] W [MSGID: 114031] 
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
0-GL1image-client-2: remote operation failed
[2019-12-05 22:00:43.790686] E [rpc-clnt.c:346:saved_frames_unwind] 
(--> 
/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce] 
(--> 
/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45] 
(--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890] 
) 0-GL1image-client-1: forced unwinding frame type(GlusterFS 4.x 
v1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428 (xid=0x89fee01)
[2019-12-05 22:00:43.790703] W [MSGID: 114031] 
[client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk] 
0-GL1image-client-1: remote operation failed