[Gluster-users] Gluster mounts becoming stale and never recovering

Jeff Bischoff Thu, 18 Jul 2019 00:57:55 -0700

Hi all,

We are having a sporadic issue with our Gluster mounts that is affecting 
several of our Kubernetes environments. We are having trouble understanding 
what is causing it, and we could use some guidance from the pros!


Scenario
We have an environment running a single-node Kubernetes with Heketi and several 
pods using Gluster mounts. The environment runs fine and the mounts appear to 
be healthy for up to several days. Suddenly, one or more (sometimes all) 
Gluster mounts report a stale mount and shut down the brick. The affected 
containers enter a crash loop that continues indefinitely, until someone 
intervenes. To work-around the crash loop, a user needs to trigger the bricks 
to be started again--either through manually starting them, restarting the 
Gluster pod or restarting the entire node.

Diagnostics
Looking at the glusterd.log file, the error message at the time the problem 
starts looks something like this:

got disconnect from stale rpc on 
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick

This message occurs once for each brick that stops responding. The brick does 
not recover on its own. Here is that same message again, with surrounding 
context included.

[2019-05-07 11:53:38.663362] I [run.c:241:runner_log] 
(-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x3a7a5) 
[0x7f795f0d77a5] 
-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe2765) 
[0x7f795f17f765] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5] 
) 0-management: Ran script: 
/var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh 
--volname=vol_d0a0dcf9903e236f68a3933c3060ec5a --last=no
[2019-05-07 11:53:38.905338] E [run.c:241:runner_log] 
(-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x3a7a5) 
[0x7f795f0d77a5] 
-->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0xe26c3) 
[0x7f795f17f6c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7f79643180f5] 
) 0-management: Failed to execute script: 
/var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh 
--volname=vol_d0a0dcf9903e236f68a3933c3060ec5a --last=no
[2019-05-07 11:53:38.982785] I [MSGID: 106542] 
[glusterd-utils.c:8253:glusterd_brick_signal] 0-glusterd: sending signal 15 to 
brick with pid 8951
[2019-05-07 11:53:39.983244] I [MSGID: 106143] 
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick 
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick
 on port 49169
[2019-05-07 11:53:39.984656] W 
[glusterd-handler.c:6124:__glusterd_brick_rpc_notify] 0-management: got 
disconnect from stale rpc on 
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_d0456279568a623a16a5508daa89b4d5/brick
[2019-05-07 11:53:40.316466] I [MSGID: 106131] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2019-05-07 11:53:40.316601] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: nfs service is stopped
[2019-05-07 11:53:40.316644] I [MSGID: 106599] 
[glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so 
xlator is not installed
[2019-05-07 11:53:40.319650] I [MSGID: 106131] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2019-05-07 11:53:40.319708] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: bitd service is 
stopped
[2019-05-07 11:53:40.321091] I [MSGID: 106131] 
[glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: scrub already stopped
[2019-05-07 11:53:40.321132] I [MSGID: 106568] 
[glusterd-svc-mgmt.c:235:glusterd_svc_stop] 0-management: scrub service is 
stopped

The version of gluster we are using (running in a container, using the 
gluster/gluster-centos image from dockerhub):


# rpm -qa | grep gluster

glusterfs-rdma-4.1.7-1.el7.x86_64

gluster-block-0.3-2.el7.x86_64

python2-gluster-4.1.7-1.el7.x86_64

centos-release-gluster41-1.0-3.el7.centos.noarch

glusterfs-4.1.7-1.el7.x86_64

glusterfs-api-4.1.7-1.el7.x86_64

glusterfs-cli-4.1.7-1.el7.x86_64

glusterfs-geo-replication-4.1.7-1.el7.x86_64

glusterfs-libs-4.1.7-1.el7.x86_64

glusterfs-client-xlators-4.1.7-1.el7.x86_64

glusterfs-fuse-4.1.7-1.el7.x86_64

glusterfs-server-4.1.7-1.el7.x86_64

The version of gluster running on our Kubernetes node (a CentOS system):


]$ rpm -qa | grep gluster

glusterfs-libs-3.12.2-18.el7.x86_64

glusterfs-3.12.2-18.el7.x86_64

glusterfs-fuse-3.12.2-18.el7.x86_64

glusterfs-client-xlators-3.12.2-18.el7.x86_64

The Kubernetes version:


$  kubectl version

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
Platform:"linux/amd64"}

Full Gluster logs available if needed, just let me know how to provide them.

Thanks in advance for any help or suggestions on this!

Best,

Jeff Bischoff
Turbonomic
This message and its attachments are intended only for the designated 
recipient(s). It may contain confidential or proprietary information and may be 
subject to legal or other confidentiality protections. If you are not a 
designated recipient, you may not review, copy or distribute this message. If 
you receive this in error, please notify the sender by reply e-mail and delete 
this message. Thank you.

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Gluster mounts becoming stale and never recovering

Reply via email to