Re: [Gluster-users] Remove Brick Rebalance Hangs With No Activity

Timothy Orme Fri, 25 Oct 2019 20:26:26 -0700

It looks like this does eventually fail.  At a bit of a loss at what to do 
here... At this point unable to remove any nodes from the cluster.  Any help is 
greatly appreciated!


Here's the log from one of the nodes

[2019-10-26 01:54:35.912284] E [rpc-clnt.c:183:call_bail] 0-scratch-client-4: 
bailing out frame type(GlusterFS 4.x v1), op(INODELK(29)), xid = 0x38, unique = 
0, sent = 2019-10-26 01:24:35.787361, timeout = 1800 for 10.158.10.2:49152
[2019-10-26 01:54:35.912304] E [MSGID: 114031] 
[client-rpc-fops_v2.c:1345:client4_0_inodelk_cbk] 0-scratch-client-4: remote 
operation failed [Transport endpoint is not connected]
[2019-10-26 02:04:35.000560] I [MSGID: 0] 
[dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: 
scratch-replicate-0,cnt = 1076350152704
[2019-10-26 02:04:35.000589] I [MSGID: 0] 
[dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size 
files = 1076350152704
[2019-10-26 02:04:35.000595] I [dht-rebalance.c:4355:dht_file_counter_thread] 
0-dht: tmp data size =1076350152704
[2019-10-26 02:14:35.000669] I [MSGID: 0] 
[dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: 
scratch-replicate-0,cnt = 1076350152704
[2019-10-26 02:14:35.000697] I [MSGID: 0] 
[dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size 
files = 1076350152704
[2019-10-26 02:14:35.000703] I [dht-rebalance.c:4355:dht_file_counter_thread] 
0-dht: tmp data size =1076350152704
[2019-10-26 02:24:35.000682] I [MSGID: 0] 
[dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: 
scratch-replicate-0,cnt = 1076350152704
[2019-10-26 02:24:35.000712] I [MSGID: 0] 
[dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size 
files = 1076350152704
[2019-10-26 02:24:35.000718] I [dht-rebalance.c:4355:dht_file_counter_thread] 
0-dht: tmp data size =1076350152704
[2019-10-26 02:24:35.867168] C [rpc-clnt.c:437:rpc_clnt_fill_request_info] 
0-scratch-client-3: cannot lookup the saved frame corresponding to xid (55)
[2019-10-26 02:24:35.867505] W [socket.c:2183:__socket_read_reply] 
0-scratch-client-3: notify for event MAP_XID failed for 10.158.10.1:49152
[2019-10-26 02:24:35.867530] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-scratch-client-3: disconnected from 
scratch-client-3. Client process will keep trying to connect to glusterd until 
brick's port is available
[2019-10-26 02:24:35.867641] C [rpc-clnt.c:437:rpc_clnt_fill_request_info] 
0-scratch-client-4: cannot lookup the saved frame corresponding to xid (56)
[2019-10-26 02:24:35.867657] W [socket.c:2183:__socket_read_reply] 
0-scratch-client-4: notify for event MAP_XID failed for 10.158.10.2:49152
[2019-10-26 02:24:35.867670] I [MSGID: 114018] 
[client.c:2323:client_rpc_notify] 0-scratch-client-4: disconnected from 
scratch-client-4. Client process will keep trying to connect to glusterd until 
brick's port is available
[2019-10-26 02:24:35.867679] W [MSGID: 108001] [afr-common.c:5608:afr_notify] 
0-scratch-replicate-0: Client-quorum is not met
[2019-10-26 02:24:35.868083] E [MSGID: 109119] 
[dht-lock.c:1084:dht_blocking_inodelk_cbk] 0-scratch-dht: inodelk failed on 
subvol scratch-replicate-0, gfid:be318638-e8a0-4c6d-977d-7a937aa84806 
[Transport endpoint is not connected]
[2019-10-26 02:24:35.868151] E [MSGID: 109016] 
[dht-rebalance.c:3932:gf_defrag_fix_layout] 0-scratch-dht: Setxattr failed for 
/.shard [Transport endpoint is not connected]
[2019-10-26 02:24:35.868904] E [MSGID: 109016] 
[dht-rebalance.c:3898:gf_defrag_fix_layout] 0-scratch-dht: Fix layout failed 
for /.shard
[2019-10-26 02:24:35.870516] I [MSGID: 109028] 
[dht-rebalance.c:5047:gf_defrag_status_get] 0-scratch-dht: Rebalance is failed. 
Time taken is 5401.00 secs
[2019-10-26 02:24:35.870531] I [MSGID: 109028] 
[dht-rebalance.c:5053:gf_defrag_status_get] 0-scratch-dht: Files migrated: 0, 
size: 0, lookups: 0, failures: 3, skipped: 0
[2019-10-26 02:24:35.871330] W [glusterfsd.c:1570:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x754b) [0x7febd4c9154b] 
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xfd) [0x55ec1a066b9d] 
-->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55ec1a0669e4] ) 0-: received 
signum (15), shutting down

Thanks!
Tim


________________________________
From: Timothy Orme
Sent: Friday, October 25, 2019 11:51 AM
To: gluster-users <gluster-users@gluster.org>
Subject: Remove Brick Rebalance Hangs With No Activity

Hello All,

I'm trying to remove a set of bricks from our cluster.  I've done this 
operation a few times now with success, but on one set of bricks, the operation 
starts and seems to never progress.  It just sits here:

                                   Node Rebalanced-files          size       
scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   
-----------   -----------   -----------         ------------     --------------
             ip-10-158-10-1.ec2.internal                0        0Bytes         
    0             0             0          in progress        0:22:35
            ip-10-158-10-2.ec2.internal                0        0Bytes          
   0             0             0          in progress        0:22:35
           ip-10-158-10-3.ec2.internal                0        0Bytes           
  0             0             0          in progress        0:22:35
Rebalance estimated time unavailable. Please try again later.

The rebalance logs on the server don't seem to indicate any issues.  I see no 
error statements or anything.  The servers themselves also seem very idle.  CPU 
and Network Activity are stuck at near 0, where as during other removals they 
would spike almost immediately.

There's almost no activity in the log either.  The only thing that I've seen is 
a message like:

[2019-10-25 18:42:21.000753] I [MSGID: 0] 
[dht-rebalance.c:4309:gf_defrag_total_file_size] 0-scratch-dht: local subvol: 
scratch-replicate-2,cnt = 596361801728
[2019-10-25 18:42:21.000799] I [MSGID: 0] 
[dht-rebalance.c:4313:gf_defrag_total_file_size] 0-scratch-dht: Total size 
files = 596361801728
[2019-10-25 18:42:21.000808] I [dht-rebalance.c:4355:dht_file_counter_thread] 
0-dht: tmp data size =596361801728

Any idea what might be happening?

Thanks,
Tim

________

Community Meeting Calendar:

APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/118564314

NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/118564314

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Remove Brick Rebalance Hangs With No Activity

Reply via email to