Re: [Gluster-users] heal: Not able to fetch volfile from glusterd

Ravishankar N Mon, 06 May 2019 23:25:29 -0700


On 06/05/19 6:43 PM, Łukasz Michalski wrote:

Hi,
I have problem resolving split-brain in one of my installations.

CenOS 7, glusterfs 3.10.12, replica on two nodes:

[root@ixmed1 iscsi]# gluster volume status cluster
Status of volume: cluster
Gluster process TCP Port RDMA PortOnline Pid------------------------------------------------------------------------------
Brick ixmed2:/glusterfs-bricks/cluster/clus
ter                                         49153     0 Y 3028
Brick ixmed1:/glusterfs-bricks/cluster/clus
ter                                         49153     0 Y 2917
Self-heal Daemon on localhost               N/A       N/A Y 112929
Self-heal Daemon on ixmed2                  N/A       N/A Y 57774

Task Status of Volume cluster
------------------------------------------------------------------------------
There are no active volume tasks

When I try to access one file glusterd reports split brain:
[2019-05-06 12:36:43.785098] E [MSGID: 108008][afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0:Failing READ on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635: split-brainobserved. [Input/output error][2019-05-06 12:36:43.787952] E [MSGID: 108008][afr-read-txn.c:90:afr_read_txn_refresh_done] 0-cluster-replicate-0:Failing FGETXATTR on gfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635:split-brain observed. [Input/output error][2019-05-06 12:36:43.788778] W [MSGID: 108027][afr-common.c:2722:afr_discover_done] 0-cluster-replicate-0: no readsubvols for (null)[2019-05-06 12:36:43.790123] W [fuse-bridge.c:2254:fuse_readv_cbk]0-glusterfs-fuse: 3352501: READ => -1gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde0803f390(Input/output error)[2019-05-06 12:36:43.794979] W [fuse-bridge.c:2254:fuse_readv_cbk]0-glusterfs-fuse: 3352506: READ => -1gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0(Input/output error)[2019-05-06 12:36:43.800468] W [fuse-bridge.c:2254:fuse_readv_cbk]0-glusterfs-fuse: 3352508: READ => -1gfid=2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 fd=0x7fde08215ed0(Input/output error)
The problem is that "gluster volume heal info" hangs for 10 secondsand returns:
    Not able to fetch volfile from glusterd
    Volume heal failed

glfsheal.log contains:
[2019-05-06 12:40:25.589879] I [afr.c:94:fix_quorum_options]0-cluster-replicate-0: reindeer: incoming qtype = none[2019-05-06 12:40:25.589967] I [afr.c:116:fix_quorum_options]0-cluster-replicate-0: reindeer: quorum_count = 0[2019-05-06 12:40:25.593294] W [MSGID: 101174][graph.c:361:_log_if_unknown_option] 0-cluster-readdir-ahead: option'parallel-readdir' is not recognized[2019-05-06 12:40:25.593895] I [MSGID: 104045][glfs-master.c:91:notify] 0-gfapi: New graph69786d65-6431-2d32-3037-3739322d3230 (0) coming up[2019-05-06 12:40:25.593972] I [MSGID: 114020] [client.c:2352:notify]0-cluster-client-0: parent translators are ready, attempting connecton transport[2019-05-06 12:40:25.607836] I [MSGID: 114020] [client.c:2352:notify]0-cluster-client-1: parent translators are ready, attempting connecton transport[2019-05-06 12:40:25.608556] I [rpc-clnt.c:2000:rpc_clnt_reconfig]0-cluster-client-0: changing port to 49153 (from 0)[2019-05-06 12:40:25.618167] I [rpc-clnt.c:2000:rpc_clnt_reconfig]0-cluster-client-1: changing port to 49153 (from 0)[2019-05-06 12:40:25.629595] I [MSGID: 114057][client-handshake.c:1451:select_server_supported_programs]0-cluster-client-0: Using Program GlusterFS 3.3, Num (1298437),Version (330)[2019-05-06 12:40:25.632031] I [MSGID: 114046][client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-0:Connected to cluster-client-0, attached to remote volume'/glusterfs-bricks/cluster/cluster'.[2019-05-06 12:40:25.632100] I [MSGID: 114047][client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-0:Server and Client lk-version numbers are not same, reopening the fds[2019-05-06 12:40:25.632263] I [MSGID: 108005][afr-common.c:4817:afr_notify] 0-cluster-replicate-0: Subvolume'cluster-client-0' came back up; going online.[2019-05-06 12:40:25.637707] I [MSGID: 114057][client-handshake.c:1451:select_server_supported_programs]0-cluster-client-1: Using Program GlusterFS 3.3, Num (1298437),Version (330)[2019-05-06 12:40:25.639285] I [MSGID: 114046][client-handshake.c:1216:client_setvolume_cbk] 0-cluster-client-1:Connected to cluster-client-1, attached to remote volume'/glusterfs-bricks/cluster/cluster'.[2019-05-06 12:40:25.639341] I [MSGID: 114047][client-handshake.c:1227:client_setvolume_cbk] 0-cluster-client-1:Server and Client lk-version numbers are not same, reopening the fds[2019-05-06 12:40:31.564407] C[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-0:server 10.0.104.26:49153 has not responded in the last 5 seconds,disconnecting.[2019-05-06 12:40:31.565764] C[rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-cluster-client-1:server 10.0.7.26:49153 has not responded in the last 5 seconds,disconnecting.

This seems to be a problem. Have you changed the value of ping-timeout? Could you share the output of `gluster volume info`?

Does the same issue occur if you try to resolve the split-brain on thegfid 2584a0e2-c0fa-4fde-8537-5d5b6a5a4635 using the |gluster volume heal<VOLNAME> split-brain |CLI?


-Ravi

[2019-05-06 12:40:35.645545] I [MSGID: 114018][client.c:2276:client_rpc_notify] 0-cluster-client-0: disconnectedfrom cluster-client-0. Client process will keep trying to connect toglusterd until brick's port is available[2019-05-06 12:40:35.645683] I [socket.c:3534:socket_submit_request]0-cluster-client-0: not connected (priv->connected = -1)[2019-05-06 12:40:35.645755] W [rpc-clnt.c:1693:rpc_clnt_submit]0-cluster-client-0: failed to submit rpc-request (XID: 0x7 Program:GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport(cluster-client-0)[2019-05-06 12:40:35.645807] W [MSGID: 114031][client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-0:remote operation failed [Drugi koniec nie jest połączony][2019-05-06 12:40:35.645887] I [socket.c:3534:socket_submit_request]0-cluster-client-1: not connected (priv->connected = -1)[2019-05-06 12:40:35.645918] W [rpc-clnt.c:1693:rpc_clnt_submit]0-cluster-client-1: failed to submit rpc-request (XID: 0x7 Program:GlusterFS 3.3, ProgVers: 330, Proc: 14) to rpc-transport(cluster-client-1)[2019-05-06 12:40:35.645955] W [MSGID: 114031][client-rpc-fops.c:797:client3_3_statfs_cbk] 0-cluster-client-1:remote operation failed [Drugi koniec nie jest połączony][2019-05-06 12:40:35.646008] W [MSGID: 109075][dht-diskusage.c:44:dht_du_info_cbk] 0-cluster-dht: failed to get diskinfo from cluster-replicate-0 [Drugi koniec nie jest połączony][2019-05-06 12:40:35.647846] I [MSGID: 114018][client.c:2276:client_rpc_notify] 0-cluster-client-1: disconnectedfrom cluster-client-1. Client process will keep trying to connect toglusterd until brick's port is available[2019-05-06 12:40:35.647895] E [MSGID: 108006][afr-common.c:4842:afr_notify] 0-cluster-replicate-0: All subvolumesare down. Going offline until atleast one of them comes back up.[2019-05-06 12:40:35.647989] I [MSGID: 108006][afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: nosubvolumes up[2019-05-06 12:40:35.648051] I [MSGID: 108006][afr-common.c:4984:afr_local_init] 0-cluster-replicate-0: nosubvolumes up[2019-05-06 12:40:35.648122] I [MSGID: 104039][glfs-resolve.c:902:__glfs_active_subvol] 0-cluster: first lookup ongraph 69786d65-6431-2d32-3037-3739322d3230 (0) failed (Drugi koniecnie jest połączony) [Drugi koniec nie jest połączony]
"Drugi koniec nie jest połączony" -> Transport endpoint not connected

On brick process side there is an connection attempt:
[2019-05-06 12:40:25.638032] I [addr.c:182:gf_auth]0-/glusterfs-bricks/cluster/cluster: allowed = "*", received addr ="10.0.7.26"[2019-05-06 12:40:25.638080] I [login.c:111:gf_auth] 0-auth/login:allowed user names: e2f4c8f4-d040-4856-b6e3-62611fbab0ea[2019-05-06 12:40:25.638109] I [MSGID: 115029][server-handshake.c:695:server_setvolume] 0-cluster-server: acceptedclient fromixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0(version: 3.10.12)[2019-05-06 12:40:31.565931] I [MSGID: 115036][server.c:559:server_rpc_notify] 0-cluster-server: disconnectingconnection fromixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0[2019-05-06 12:40:31.566420] I [MSGID: 101055][client_t.c:436:gf_client_unref] 0-cluster-server: Shutting downconnection ixmed1-207792-2019/05/06-12:40:25:562982-cluster-client-1-0-0
I am not able to use any heal command because of this problem.
I have three volumes configured on that nodes. Configuration isidentical and "gluster volume heal" command fails for all of them.
Can anyone help?

Thanks,
Łukasz


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] heal: Not able to fetch volfile from glusterd

Reply via email to