Re: [Gluster-users] Is rebalance in progress or not?
On 3/15/20 5:17 PM, Strahil Nikolov wrote: On March 15, 2020 12:16:51 PM GMT+02:00, Alexander Iliev wrote: On 3/15/20 11:07 AM, Strahil Nikolov wrote: On March 15, 2020 11:50:32 AM GMT+02:00, Alexander Iliev wrote: Hi list, I was having some issues with one of my Gluster nodes so I ended up re-installing it. Now I want to re-add the bricks for my main volume and I'm having the following issue - when I try to add the bricks I get: # gluster volume add-brick store1 replica 3 volume add-brick: failed: Pre Validation failed on 172.31.35.132. Volume name store1 rebalance is in progress. Please retry after completion But then if I get the rebalance status I get: # gluster volume rebalance store1 status volume rebalance: store1: failed: Rebalance not started for volume store1. And if I try to start the rebalancing I get: # gluster volume rebalance store1 start volume rebalance: store1: failed: Rebalance on store1 is already started Looking at the logs of the first node, when I try to start the rebalance operation I see this: [2020-03-15 09:41:31.883651] E [MSGID: 106276] [glusterd-rpc-ops.c:1200:__glusterd_stage_op_cbk] 0-management: Received stage RJT from uuid: 9476b8bb-d7ee-489a-b083-875805343e67 On the second node the logs are showing stuff that indicates that a rebalance operation is indeed in progress: [2020-03-15 09:47:34.190042] I [MSGID: 109081] [dht-common.c:5868:dht_setxattr] 0-store1-dht: fixing the layout of /redacted [2020-03-15 09:47:34.775691] I [dht-rebalance.c:3285:gf_defrag_process_dir] 0-store1-dht: migrate data called on /redacted [2020-03-15 09:47:36.019403] I [dht-rebalance.c:3480:gf_defrag_process_dir] 0-store1-dht: Migration operation on dir /redacted took 1.24 secs Some background on what led to this situation: The volume was originally a replica 3 distributed replicated volume on three nodes. In order to detach the faulty node I lowered the replica count to 2 and removed the bricks from that node from the volume. I cleaned up the storage (formatted the bricks and cleaned the trusted.gfid and trusted.glusterfs.volume-id extended attributes) and purged the gluster packages from the system, then I re-installed the gluster packages and did a `gluster peer probe` from another node. I'm running Gluster 6.6 on CentOS 7.7 on all nodes. I feel stuck at this point, so any guidance will be greatly appreciated. Thanks! Best regards, Hey Alex, Did you try to go the second node (the one tgat thinks balance is running) and stop tge balance ? gluster volume rebalance VOLNAME stop Then add the new brick (and increase the replica count) and after the heal is over - rebalance again. Hey Strahil, Thanks for the suggestion, I just tried it, but unfortunately the result is pretty much the same - when I try to stop the rebalance on the second node it reports that no rebalance is in progress: # gluster volume rebalance store1 stop volume rebalance: store1: failed: Rebalance not started for volume store1. Best Regards, Strahil Nikolov Best regards, -- alexander iliev Hey Alex, I'm not sure if the command has a 'force' flag, but of it does - it is worth trying. gluster volume rebalance store1 stop force Hey Strahil, Thank again for your suggestions! According to the `gluster volume rebalance help` output only the `start` subcommand supports a force flag. I tried that already, unfortunately it doesn't help: # gluster volume rebalance store1 start force volume rebalance: store1: failed: Rebalance on store1 is already started # gluster volume rebalance store1 stop volume rebalance: store1: failed: Rebalance not started for volume store1. Sadly, as the second node thinks balance is running - I'm not sure if a 'start force' (to convince both nodes that balance is runking )and then 'stop' will have the expected effect. The rebalance is indeed running on the second node judging from the contents of /var/log/glusterfs/store1-rebalance.log. Sadly, this situation is hard to reproduce. In any way , a bug report should be opened . The thing is I'm not sure if I can provide meaningful steps to reproduce at this point. I didn't keep proper track of all the things I attempted, so I'm not sure if the bug report I can file would be of much value. :( Keep in mind that I do not have a distributed volume , so everything above is pure speculation. Based on my experience - a gluster upgrade can fix odd situations like that, but also it could make things worse . So for now avoid any upgrades, until a dev confirms it is safe to do. Yeah, I'd rather wait for the rebalance to finish before I make any further attempts at it. Sadly the storage is backed by rather slow (spinning) drives, so it might take a while, but even so I prefer being safe than sorry. :) Best Regards, Strahil Nikolov Best regards, -- alexander iliev Community Meeting Calendar: Schedule - Every Tuesday at
Re: [Gluster-users] Is rebalance in progress or not?
On 3/15/20 11:07 AM, Strahil Nikolov wrote: On March 15, 2020 11:50:32 AM GMT+02:00, Alexander Iliev wrote: Hi list, I was having some issues with one of my Gluster nodes so I ended up re-installing it. Now I want to re-add the bricks for my main volume and I'm having the following issue - when I try to add the bricks I get: # gluster volume add-brick store1 replica 3 volume add-brick: failed: Pre Validation failed on 172.31.35.132. Volume name store1 rebalance is in progress. Please retry after completion But then if I get the rebalance status I get: # gluster volume rebalance store1 status volume rebalance: store1: failed: Rebalance not started for volume store1. And if I try to start the rebalancing I get: # gluster volume rebalance store1 start volume rebalance: store1: failed: Rebalance on store1 is already started Looking at the logs of the first node, when I try to start the rebalance operation I see this: [2020-03-15 09:41:31.883651] E [MSGID: 106276] [glusterd-rpc-ops.c:1200:__glusterd_stage_op_cbk] 0-management: Received stage RJT from uuid: 9476b8bb-d7ee-489a-b083-875805343e67 On the second node the logs are showing stuff that indicates that a rebalance operation is indeed in progress: [2020-03-15 09:47:34.190042] I [MSGID: 109081] [dht-common.c:5868:dht_setxattr] 0-store1-dht: fixing the layout of /redacted [2020-03-15 09:47:34.775691] I [dht-rebalance.c:3285:gf_defrag_process_dir] 0-store1-dht: migrate data called on /redacted [2020-03-15 09:47:36.019403] I [dht-rebalance.c:3480:gf_defrag_process_dir] 0-store1-dht: Migration operation on dir /redacted took 1.24 secs Some background on what led to this situation: The volume was originally a replica 3 distributed replicated volume on three nodes. In order to detach the faulty node I lowered the replica count to 2 and removed the bricks from that node from the volume. I cleaned up the storage (formatted the bricks and cleaned the trusted.gfid and trusted.glusterfs.volume-id extended attributes) and purged the gluster packages from the system, then I re-installed the gluster packages and did a `gluster peer probe` from another node. I'm running Gluster 6.6 on CentOS 7.7 on all nodes. I feel stuck at this point, so any guidance will be greatly appreciated. Thanks! Best regards, Hey Alex, Did you try to go the second node (the one tgat thinks balance is running) and stop tge balance ? gluster volume rebalance VOLNAME stop Then add the new brick (and increase the replica count) and after the heal is over - rebalance again. Hey Strahil, Thanks for the suggestion, I just tried it, but unfortunately the result is pretty much the same - when I try to stop the rebalance on the second node it reports that no rebalance is in progress: > # gluster volume rebalance store1 stop > volume rebalance: store1: failed: Rebalance not started for volume store1. Best Regards, Strahil Nikolov Best regards, -- alexander iliev Community Meeting Calendar: Schedule - Every Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Is rebalance in progress or not?
Hi list, I was having some issues with one of my Gluster nodes so I ended up re-installing it. Now I want to re-add the bricks for my main volume and I'm having the following issue - when I try to add the bricks I get: > # gluster volume add-brick store1 replica 3 > volume add-brick: failed: Pre Validation failed on 172.31.35.132. Volume name store1 rebalance is in progress. Please retry after completion But then if I get the rebalance status I get: > # gluster volume rebalance store1 status > volume rebalance: store1: failed: Rebalance not started for volume store1. And if I try to start the rebalancing I get: > # gluster volume rebalance store1 start > volume rebalance: store1: failed: Rebalance on store1 is already started Looking at the logs of the first node, when I try to start the rebalance operation I see this: > [2020-03-15 09:41:31.883651] E [MSGID: 106276] [glusterd-rpc-ops.c:1200:__glusterd_stage_op_cbk] 0-management: Received stage RJT from uuid: 9476b8bb-d7ee-489a-b083-875805343e67 On the second node the logs are showing stuff that indicates that a rebalance operation is indeed in progress: > [2020-03-15 09:47:34.190042] I [MSGID: 109081] [dht-common.c:5868:dht_setxattr] 0-store1-dht: fixing the layout of /redacted > [2020-03-15 09:47:34.775691] I [dht-rebalance.c:3285:gf_defrag_process_dir] 0-store1-dht: migrate data called on /redacted > [2020-03-15 09:47:36.019403] I [dht-rebalance.c:3480:gf_defrag_process_dir] 0-store1-dht: Migration operation on dir /redacted took 1.24 secs Some background on what led to this situation: The volume was originally a replica 3 distributed replicated volume on three nodes. In order to detach the faulty node I lowered the replica count to 2 and removed the bricks from that node from the volume. I cleaned up the storage (formatted the bricks and cleaned the trusted.gfid and trusted.glusterfs.volume-id extended attributes) and purged the gluster packages from the system, then I re-installed the gluster packages and did a `gluster peer probe` from another node. I'm running Gluster 6.6 on CentOS 7.7 on all nodes. I feel stuck at this point, so any guidance will be greatly appreciated. Thanks! Best regards, -- alexander iliev Community Meeting Calendar: Schedule - Every Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users