Re: [Gluster-users] Fwd: VM freeze issue on simple gluster setup.

WK Thu, 12 Dec 2019 10:38:36 -0800


On 12/12/2019 4:34 AM, Ravishankar N wrote:

On 12/12/19 4:01 am, WK wrote:
<BUMP> so I can get some sort of resolution on the issue (i.e. is ithardware, Gluster etc)
I guess what I really need to know is
1) Node 2 complains that it cant reach node 1 and node 3. If thiswas an OS/Hardware networking issue and not internal to Gluster ,then why didn't node1 and node3 have error message complaining aboutnot reaching node2
[2019-12-05 22:00:43.739804] C[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-2:server 10.255.1.1:49153 has not responded in the last 21 seconds,disconnecting.[2019-12-05 22:00:43.757095] C[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired] 0-GL1image-client-1:server 10.255.1.3:49152 has not responded in the last 21 seconds,disconnecting.[2019-12-05 22:00:43.757191] I [MSGID: 114018][client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnectedfrom GL1image-client-2. Client process will keep trying to connect toglusterd until brick's port is available[2019-12-05 22:00:43.757246] I [MSGID: 114018][client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnectedfrom GL1image-client-1. Client process will keep trying to connect toglusterd until brick's port is available
[2019-12-05 22:00:43.757266] W [MSGID: 108001][afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorumis not met
This seems to indicate the mount on node 2 cannot reach 2 bricks. Ifquorum is not met, you will get ENOTCONN on the mount. Maybe check ifthe mount is still disconnected from the bricks (either statedump orlooking at the .meta folder)?

ok, it is a localhost fuse mount if that is more needed information.Should we be mounting on the actual IP of the Gluster Network?

The gluster setup on Node 2 returned to normal 11 seconds later with themount reconnecting and every thing was fine when we were finally notfiedof a problem and investigated (the VM lockup had already occured)

I'm not sure about the client port change from 0 back to 49153. Is thata clue? where did port 0 come from?

So is this an OS/Fuse problem with just the Node 2 mount locallybecoming "confused" and then recovering?

Again Node 1 and Node 3 were happily reaching Node 2 from theirperspective while this is was occurring. They never lost theirconnection to Node 2 from their perspective.

[2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig]0-GL1image-client-2: changing port to 49153 (from 0)[2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig]0-GL1image-client-1: changing port to 49152 (from 0)[2019-12-05 22:00:54.820394] I [MSGID: 114046][client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1:Connected to GL1image-client-1, attached to remote volume'/GLUSTER/GL1image'.[2019-12-05 22:00:54.820447] I [MSGID: 114042][client-handshake.c:930:client_post_handshake] 0-GL1image-client-1: 10fds open - Delaying child_up until they are re-opened[2019-12-05 22:00:54.820549] I [MSGID: 114046][client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2:Connected to GL1image-client-2, attached to remote volume'/GLUSTER/GL1image'.[2019-12-05 22:00:54.820568] I [MSGID: 114042][client-handshake.c:930:client_post_handshake] 0-GL1image-client-2: 10fds open - Delaying child_up until they are re-opened[2019-12-05 22:00:54.821381] I [MSGID: 114041][client-handshake.c:318:client_child_up_reopen_done]0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP[2019-12-05 22:00:54.821406] I [MSGID: 108002][afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorum is met[2019-12-05 22:00:54.821446] I [MSGID: 114041][client-handshake.c:318:client_child_up_reopen_done]0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP

In the meantime, we reupped the timeout to the default of 42 secondswhich would have prevented the VM freeze. I suspect there was reasonthat is the default

-wk

-------- Forwarded Message --------
Subject:        VM freeze issue on simple gluster setup.
Date:   Thu, 5 Dec 2019 16:23:35 -0800
From:   WK <wkm...@bneit.com>
To:     Gluster Users <gluster-users@gluster.org>



I have a replica2+arbiter setup that is used for VMs.

ip #.1 is the arb

ip #.2 and #.3 are the kvm hosts.
Two Volumes are involved and its gluster 6.5/Ubuntu 18.4/fuse TheGluster networking uses a two ethernet card teamd/round-robin setupwhich *should* have stayed up if one of the ports had failed.
I just had a number of VMs go Read-Only due to the belowcommunication failure at 22:00 but only on kvm host #2
VMs on the same gluster volumes on kvm host 3 were unaffected.

The logs on host #2 show the following:
[2019-12-05 22:00:43.739804] C[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]0-GL1image-client-2: server 10.255.1.1:49153 has not responded in thelast 21 seconds, disconnecting.[2019-12-05 22:00:43.757095] C[rpc-clnt-ping.c:155:rpc_clnt_ping_timer_expired]0-GL1image-client-1: server 10.255.1.3:49152 has not responded in thelast 21 seconds, disconnecting.[2019-12-05 22:00:43.757191] I [MSGID: 114018][client.c:2323:client_rpc_notify] 0-GL1image-client-2: disconnectedfrom GL1image-client-2. Client process will keep trying to connect toglusterd until brick's port is available[2019-12-05 22:00:43.757246] I [MSGID: 114018][client.c:2323:client_rpc_notify] 0-GL1image-client-1: disconnectedfrom GL1image-client-1. Client process will keep trying to connect toglusterd until brick's port is available[2019-12-05 22:00:43.757266] W [MSGID: 108001][afr-common.c:5608:afr_notify] 0-GL1image-replicate-0: Client-quorumis not met[2019-12-05 22:00:43.790639] E [rpc-clnt.c:346:saved_frames_unwind](-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce](-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.xv1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736456 (xid=0x825bffb)[2019-12-05 22:00:43.790655] W [MSGID: 114031][client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]0-GL1image-client-2: remote operation failed[2019-12-05 22:00:43.790686] E [rpc-clnt.c:346:saved_frames_unwind](-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce](-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]))))) 0-GL1image-client-1: forced unwinding frame type(GlusterFS 4.xv1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736428 (xid=0x89fee01)[2019-12-05 22:00:43.790703] W [MSGID: 114031][client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]0-GL1image-client-1: remote operation failed[2019-12-05 22:00:43.790774] E [MSGID: 114031][client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk]0-GL1image-client-1: remote operation failed [Transport endpoint isnot connected][2019-12-05 22:00:43.790777] E [rpc-clnt.c:346:saved_frames_unwind](-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x139)[0x7f030d045f59](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xcbb0)[0x7f030cdf0bb0](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xccce)[0x7f030cdf0cce](-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x95)[0x7f030cdf1c45](--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xe890)[0x7f030cdf2890]))))) 0-GL1image-client-2: forced unwinding frame type(GlusterFS 4.xv1) op(FXATTROP(34)) called at 2019-12-05 22:00:19.736542 (xid=0x825bffc)[2019-12-05 22:00:43.790794] W [MSGID: 114029][client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-1:failed to send the fop[2019-12-05 22:00:43.790806] W [MSGID: 114031][client-rpc-fops_v2.c:1614:client4_0_fxattrop_cbk]0-GL1image-client-2: remote operation failed[2019-12-05 22:00:43.790825] E [MSGID: 114031][client-rpc-fops_v2.c:1393:client4_0_finodelk_cbk]0-GL1image-client-2: remote operation failed [Transport endpoint isnot connected][2019-12-05 22:00:43.790842] W [MSGID: 114029][client-rpc-fops_v2.c:4873:client4_0_finodelk] 0-GL1image-client-2:failed to send the fop
the fop/transport not connected errors just repeat for another 50lines or so until I hit 22:00:46 seconds at which point the Volumesappear to be fine (though the VMs were still read-only until I rebooted.
[2019-12-05 22:00:46.987242] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701328: READ => -1gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708(Transport endpoint is not connected)[2019-12-05 22:00:47.029947] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701329: READ => -1gfid=d883b7c4-97f5-4f12-9373-7987cfc7dee4 fd=0x7f02f005b708(Transport endpoint is not connected)[2019-12-05 22:00:49.901075] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701330: READ => -1gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8(Transport endpoint is not connected)[2019-12-05 22:00:49.923525] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701331: READ => -1gfid=c342dba6-a2a2-49a8-be3f-cd320e90c956 fd=0x7f02f002bee8(Transport endpoint is not connected)[2019-12-05 22:00:49.970219] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701332: READ => -1gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58(Transport endpoint is not connected)[2019-12-05 22:00:50.023932] W [fuse-bridge.c:2827:fuse_readv_cbk]0-glusterfs-fuse: 91701333: READ => -1gfid=fcec6b7a-ad23-4449-aa09-107e113877a1 fd=0x7f02f008dd58(Transport endpoint is not connected)[2019-12-05 22:00:54.807833] I [rpc-clnt.c:2028:rpc_clnt_reconfig]0-GL1image-client-2: changing port to 49153 (from 0)[2019-12-05 22:00:54.808043] I [rpc-clnt.c:2028:rpc_clnt_reconfig]0-GL1image-client-1: changing port to 49152 (from 0)[2019-12-05 22:00:46.115076] E [MSGID: 133014][shard.c:1799:shard_common_stat_cbk] 0-GL1image-shard: stat failed:7a5959d6-75fc-411d-8831-57a744776ed3 [Transport endpoint is notconnected][2019-12-05 22:00:54.820394] I [MSGID: 114046][client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-1:Connected to GL1image-client-1, attached to remote volume'/GLUSTER/GL1image'.[2019-12-05 22:00:54.820447] I [MSGID: 114042][client-handshake.c:930:client_post_handshake] 0-GL1image-client-1:10 fds open - Delaying child_up until they are re-opened[2019-12-05 22:00:54.820549] I [MSGID: 114046][client-handshake.c:1106:client_setvolume_cbk] 0-GL1image-client-2:Connected to GL1image-client-2, attached to remote volume'/GLUSTER/GL1image'.[2019-12-05 22:00:54.820568] I [MSGID: 114042][client-handshake.c:930:client_post_handshake] 0-GL1image-client-2:10 fds open - Delaying child_up until they are re-opened[2019-12-05 22:00:54.821381] I [MSGID: 114041][client-handshake.c:318:client_child_up_reopen_done]0-GL1image-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP[2019-12-05 22:00:54.821406] I [MSGID: 108002][afr-common.c:5602:afr_notify] 0-GL1image-replicate-0: Client-quorumis met[2019-12-05 22:00:54.821446] I [MSGID: 114041][client-handshake.c:318:client_child_up_reopen_done]0-GL1image-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP
What is odd is that the gluster logs on the #3 and #1 show absolutelyZERO gluster errors around that time nor do I show any Network/teamderrors on any of the 3 nodes (including the problem node #2)
I've checked dmesg/syslog and every other log file on the box.
According to a staff member, we had this same kvm host have the sameproblem about 3 weeks ago, it was written up as a fluke possible dueto excess disk I/O, since we have been using gluster for years andrarely have seen issues, especially with very basic gluster usage.
In this case those VMs weren't overly busy and now we have a repeatproblem.
So I am wondering where else I can look to diagnose the problem orshould I abandon the hardware/setup?
I assume its a networking issue and not on gluster, but I am confusedwhy gluster nodes #1 and #3 didn't complain about not seeing #2? Ifthe networking did drop out should they have noticed?
There also doesn't appear to be any visible hard disk issues (smartdis running)
Side Note: I have reset the tcp-timeout back to 42 seconds and willlook at upgrading to 6.6. I also see that the ARB and the unaffectedGluster node were running Gluster 6.4 (I don't know why #2 is on 6.5but I am checking on that as well, we turn off auto-upgrade)
Maybe the mismatched versions are the culprit?
Also, we have a large of these replica 2+1 gluster setups runninggluster version from 5.x up and none of the others have had this issue
Any advise would be appreciated.

Sincerely,

Wk






________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge:https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge:https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Fwd: VM freeze issue on simple gluster setup.

Reply via email to