Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
If you use the same block device for the arbiter, I would recommend you to 'mkfs' again. For example , XFS brick will be done via 'mkfs.xfs -f -i size=512 /dev/DEVICE'. Reusing a brick without recreating the FS is error-prone. Also, don't forget to create your brick dir , once the device is mounted. Best Regards, Strahil Nikolov В вторник, 27 октомври 2020 г., 08:41:11 Гринуич+2, mabi написа: First to answer your question how this first happened, I reached that issue first by simply rebooting my arbiter node yesterday morning in order to due some maintenance which I do on a regular basis and was never a problem before GlusterFS 7.8. I have now removed the arbiter brick from all of my volumes (I have 3 volumes and only one volume uses quota). So I was then able to do a "detach" and then a "probe" of my arbiter node. So far so good, so I decided to add back an aribter brick to one of my smallest volumes which does not have quota but I get the following error message: $ gluster volume add-brick othervol replica 3 arbiter 1 arbiternode.domain.tld:/srv/glusterfs/othervol/brick volume add-brick: failed: Commit failed on arbiternode.domain.tld. Please check log file for details. Checking the glusterd.log file of the arbiter node shows the following: [2020-10-27 06:25:36.011955] I [MSGID: 106578] [glusterd-brick-ops.c:1024:glusterd_op_perform_add_bricks] 0-management: replica-count is set 3 [2020-10-27 06:25:36.011988] I [MSGID: 106578] [glusterd-brick-ops.c:1029:glusterd_op_perform_add_bricks] 0-management: arbiter-count is set 1 [2020-10-27 06:25:36.012017] I [MSGID: 106578] [glusterd-brick-ops.c:1033:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it [2020-10-27 06:25:36.093551] E [MSGID: 106053] [glusterd-utils.c:13790:glusterd_handle_replicate_brick_ops] 0-management: Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected] [2020-10-27 06:25:36.104897] E [MSGID: 101042] [compat.c:605:gf_umount_lazy] 0-management: Lazy unmount of /tmp/mntQQVzyD [Transport endpoint is not connected] [2020-10-27 06:25:36.104973] E [MSGID: 106073] [glusterd-brick-ops.c:2051:glusterd_op_add_brick] 0-glusterd: Unable to add bricks [2020-10-27 06:25:36.105001] E [MSGID: 106122] [glusterd-mgmt.c:317:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit failed. [2020-10-27 06:25:36.105023] E [MSGID: 106122] [glusterd-mgmt-handler.c:594:glusterd_handle_commit_fn] 0-management: commit failed on operation Add brick After that I tried to restart the glusterd service on my arbiter node and now it is again rejected from the other nodes with exactly the same error message as yesterday regarding the quota checksum being different as you can see here: [2020-10-27 06:30:21.729577] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node2.domain.tld [2020-10-27 06:30:21.731966] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node1.domain.tld This is really weird because at this stage I did not even try yet to add the brick to the arbiter node from my volume which has the quota enabled... After detaching the arbiter node, am I supposed to delete something on the arbiter node? Something is really wrong here and I am stuck in a loop somehow... any help would be greatly appreciated. ‐‐‐ Original Message ‐‐‐ On Tuesday, October 27, 2020 1:26 AM, Strahil Nikolov wrote: > You need to fix that "reject" issue before trying anything else. > Have you tried to "detach" the arbiter and then "probe" it again ? > > I have no idea what you did to reach that state - can you provide the details > ? > > Best Regards, > Strahil Nikolov > > В понеделник, 26 октомври 2020 г., 20:38:38 Гринуич+2, mabi > m...@protonmail.ch написа: > > Ok I see I won't go down that path of disabling quota. > > I could now remove the arbiter brick of my volume which has the quota issue > so it is now a simple 2 nodes replica with 1 brick per node. > > Now I would like to add the brick back but I get the following error: > > volume add-brick: failed: Host arbiternode.domain.tld is not in 'Peer in > Cluster' state > > In fact I checked and the arbiter node is still rejected as you can see here: > > State: Peer Rejected (Connected) > > On the arbiter node glusted.log file I see the following errors: > > [2020-10-26 18:35:05.605124] E [MSGID: 106012] > [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums > of quota configuration of volume woelkli-private differ. local cksum = 0, > remote cksum = 66908910 on peer node1.domain.tld > [2020-10-26 18:35:05.617009] E [MSGID: 106012] > [glusterd-utils
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
First to answer your question how this first happened, I reached that issue first by simply rebooting my arbiter node yesterday morning in order to due some maintenance which I do on a regular basis and was never a problem before GlusterFS 7.8. I have now removed the arbiter brick from all of my volumes (I have 3 volumes and only one volume uses quota). So I was then able to do a "detach" and then a "probe" of my arbiter node. So far so good, so I decided to add back an aribter brick to one of my smallest volumes which does not have quota but I get the following error message: $ gluster volume add-brick othervol replica 3 arbiter 1 arbiternode.domain.tld:/srv/glusterfs/othervol/brick volume add-brick: failed: Commit failed on arbiternode.domain.tld. Please check log file for details. Checking the glusterd.log file of the arbiter node shows the following: [2020-10-27 06:25:36.011955] I [MSGID: 106578] [glusterd-brick-ops.c:1024:glusterd_op_perform_add_bricks] 0-management: replica-count is set 3 [2020-10-27 06:25:36.011988] I [MSGID: 106578] [glusterd-brick-ops.c:1029:glusterd_op_perform_add_bricks] 0-management: arbiter-count is set 1 [2020-10-27 06:25:36.012017] I [MSGID: 106578] [glusterd-brick-ops.c:1033:glusterd_op_perform_add_bricks] 0-management: type is set 0, need to change it [2020-10-27 06:25:36.093551] E [MSGID: 106053] [glusterd-utils.c:13790:glusterd_handle_replicate_brick_ops] 0-management: Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected] [2020-10-27 06:25:36.104897] E [MSGID: 101042] [compat.c:605:gf_umount_lazy] 0-management: Lazy unmount of /tmp/mntQQVzyD [Transport endpoint is not connected] [2020-10-27 06:25:36.104973] E [MSGID: 106073] [glusterd-brick-ops.c:2051:glusterd_op_add_brick] 0-glusterd: Unable to add bricks [2020-10-27 06:25:36.105001] E [MSGID: 106122] [glusterd-mgmt.c:317:gd_mgmt_v3_commit_fn] 0-management: Add-brick commit failed. [2020-10-27 06:25:36.105023] E [MSGID: 106122] [glusterd-mgmt-handler.c:594:glusterd_handle_commit_fn] 0-management: commit failed on operation Add brick After that I tried to restart the glusterd service on my arbiter node and now it is again rejected from the other nodes with exactly the same error message as yesterday regarding the quota checksum being different as you can see here: [2020-10-27 06:30:21.729577] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node2.domain.tld [2020-10-27 06:30:21.731966] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node1.domain.tld This is really weird because at this stage I did not even try yet to add the brick to the arbiter node from my volume which has the quota enabled... After detaching the arbiter node, am I supposed to delete something on the arbiter node? Something is really wrong here and I am stuck in a loop somehow... any help would be greatly appreciated. ‐‐‐ Original Message ‐‐‐ On Tuesday, October 27, 2020 1:26 AM, Strahil Nikolov wrote: > You need to fix that "reject" issue before trying anything else. > Have you tried to "detach" the arbiter and then "probe" it again ? > > I have no idea what you did to reach that state - can you provide the details > ? > > Best Regards, > Strahil Nikolov > > В понеделник, 26 октомври 2020 г., 20:38:38 Гринуич+2, mabi > m...@protonmail.ch написа: > > Ok I see I won't go down that path of disabling quota. > > I could now remove the arbiter brick of my volume which has the quota issue > so it is now a simple 2 nodes replica with 1 brick per node. > > Now I would like to add the brick back but I get the following error: > > volume add-brick: failed: Host arbiternode.domain.tld is not in 'Peer in > Cluster' state > > In fact I checked and the arbiter node is still rejected as you can see here: > > State: Peer Rejected (Connected) > > On the arbiter node glusted.log file I see the following errors: > > [2020-10-26 18:35:05.605124] E [MSGID: 106012] > [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums > of quota configuration of volume woelkli-private differ. local cksum = 0, > remote cksum = 66908910 on peer node1.domain.tld > [2020-10-26 18:35:05.617009] E [MSGID: 106012] > [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums > of quota configuration of volume myvol-private differ. local cksum = 0, > remote cksum = 66908910 on peer node2.domain.tld > > So although I have removed the arbiter brick from my volume it it still > complains about that checksum of the quota configuration. I also tried to > restart glusterd on my arbiter node but it does not help. The peer is s
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Il 27/10/20 07:40, mabi ha scritto: > First to answer your question how this first happened, I reached that issue > first by simply rebooting my arbiter node yesterday morning in order to due > some maintenance which I do on a regular basis and was never a problem before > GlusterFS 7.8. In my case the problem originated from the daemon being reaped by OOM killer, but the result was the same. You're in the same rat hole I've been into... IIRC you have to probe *a working node from the detached node* . I followed these instructions: https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Administrator%20Guide/Resolving%20Peer%20Rejected/ Yes, they're for an ancient version, but it worked... -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
You need to fix that "reject" issue before trying anything else. Have you tried to "detach" the arbiter and then "probe" it again ? I have no idea what you did to reach that state - can you provide the details ? Best Regards, Strahil Nikolov В понеделник, 26 октомври 2020 г., 20:38:38 Гринуич+2, mabi написа: Ok I see I won't go down that path of disabling quota. I could now remove the arbiter brick of my volume which has the quota issue so it is now a simple 2 nodes replica with 1 brick per node. Now I would like to add the brick back but I get the following error: volume add-brick: failed: Host arbiternode.domain.tld is not in 'Peer in Cluster' state In fact I checked and the arbiter node is still rejected as you can see here: State: Peer Rejected (Connected) On the arbiter node glusted.log file I see the following errors: [2020-10-26 18:35:05.605124] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume woelkli-private differ. local cksum = 0, remote cksum = 66908910 on peer node1.domain.tld [2020-10-26 18:35:05.617009] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node2.domain.tld So although I have removed the arbiter brick from my volume it it still complains about that checksum of the quota configuration. I also tried to restart glusterd on my arbiter node but it does not help. The peer is still rejected. What should I do at this stage? ‐‐‐ Original Message ‐‐‐ On Monday, October 26, 2020 6:06 PM, Strahil Nikolov wrote: > Detaching the arbiter is pointless... > Quota is an extended file attribute, and thus disabling and reenabling quota > on a volume with millions of files will take a lot of time and lots of IOPS. > I would leave it as a last resort. > > Also, it was mentioned in the list about the following script that might help > you: > https://github.com/gluster/glusterfs/blob/devel/extras/quota/quota_fsck.py > > You can take a look in the mailing list for usage and more details. > > Best Regards, > Strahil Nikolov > > В понеделник, 26 октомври 2020 г., 16:40:06 Гринуич+2, Diego Zuccato > diego.zucc...@unibo.it написа: > > Il 26/10/20 15:09, mabi ha scritto: > > > Right, seen liked that this sounds reasonable. Do you actually remember the > > exact command you ran in order to remove the brick? I was thinking this > > should be it: > > gluster volume remove-brick force > > but should I use "force" or "start"? > > Memory does not serve me well (there are 28 disks, not 26!), but bash > history does :) > > gluster volume remove-brick BigVol replica 2 > > = > > str957-biostq:/srv/arbiters/{00..27}/BigVol force > > gluster peer detach str957-biostq > > == > > gluster peer probe str957-biostq > > = > > gluster volume add-brick BigVol replica 3 arbiter 1 > > > > str957-biostq:/srv/arbiters/{00..27}/BigVol > > You obviously have to wait for remove-brick to complete before detaching > arbiter. > > > > IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM) > > > that uses an iSCSI disk. More than 80% continuous load on both CPUs and > > > RAM. > > > That's quite long I must say and I am in the same case as you, my arbiter > > > is a VM. > > Give all the CPU and RAM you can. Less than 8GB RAM is asking for > troubles (in my case). > > - > > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Ok I see I won't go down that path of disabling quota. I could now remove the arbiter brick of my volume which has the quota issue so it is now a simple 2 nodes replica with 1 brick per node. Now I would like to add the brick back but I get the following error: volume add-brick: failed: Host arbiternode.domain.tld is not in 'Peer in Cluster' state In fact I checked and the arbiter node is still rejected as you can see here: State: Peer Rejected (Connected) On the arbiter node glusted.log file I see the following errors: [2020-10-26 18:35:05.605124] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume woelkli-private differ. local cksum = 0, remote cksum = 66908910 on peer node1.domain.tld [2020-10-26 18:35:05.617009] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node2.domain.tld So although I have removed the arbiter brick from my volume it it still complains about that checksum of the quota configuration. I also tried to restart glusterd on my arbiter node but it does not help. The peer is still rejected. What should I do at this stage? ‐‐‐ Original Message ‐‐‐ On Monday, October 26, 2020 6:06 PM, Strahil Nikolov wrote: > Detaching the arbiter is pointless... > Quota is an extended file attribute, and thus disabling and reenabling quota > on a volume with millions of files will take a lot of time and lots of IOPS. > I would leave it as a last resort. > > Also, it was mentioned in the list about the following script that might help > you: > https://github.com/gluster/glusterfs/blob/devel/extras/quota/quota_fsck.py > > You can take a look in the mailing list for usage and more details. > > Best Regards, > Strahil Nikolov > > В понеделник, 26 октомври 2020 г., 16:40:06 Гринуич+2, Diego Zuccato > diego.zucc...@unibo.it написа: > > Il 26/10/20 15:09, mabi ha scritto: > > > Right, seen liked that this sounds reasonable. Do you actually remember the > > exact command you ran in order to remove the brick? I was thinking this > > should be it: > > gluster volume remove-brick force > > but should I use "force" or "start"? > > Memory does not serve me well (there are 28 disks, not 26!), but bash > history does :) > > gluster volume remove-brick BigVol replica 2 > > = > > str957-biostq:/srv/arbiters/{00..27}/BigVol force > > gluster peer detach str957-biostq > > == > > gluster peer probe str957-biostq > > = > > gluster volume add-brick BigVol replica 3 arbiter 1 > > > > str957-biostq:/srv/arbiters/{00..27}/BigVol > > You obviously have to wait for remove-brick to complete before detaching > arbiter. > > > > IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM) > > > that uses an iSCSI disk. More than 80% continuous load on both CPUs and > > > RAM. > > > That's quite long I must say and I am in the same case as you, my arbiter > > > is a VM. > > Give all the CPU and RAM you can. Less than 8GB RAM is asking for > troubles (in my case). > > - > > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Detaching the arbiter is pointless... Quota is an extended file attribute, and thus disabling and reenabling quota on a volume with millions of files will take a lot of time and lots of IOPS. I would leave it as a last resort. Also, it was mentioned in the list about the following script that might help you: https://github.com/gluster/glusterfs/blob/devel/extras/quota/quota_fsck.py You can take a look in the mailing list for usage and more details. Best Regards, Strahil Nikolov В понеделник, 26 октомври 2020 г., 16:40:06 Гринуич+2, Diego Zuccato написа: Il 26/10/20 15:09, mabi ha scritto: > Right, seen liked that this sounds reasonable. Do you actually remember the > exact command you ran in order to remove the brick? I was thinking this > should be it: > gluster volume remove-brick force > but should I use "force" or "start"? Memory does not serve me well (there are 28 disks, not 26!), but bash history does :) # gluster volume remove-brick BigVol replica 2 str957-biostq:/srv/arbiters/{00..27}/BigVol force # gluster peer detach str957-biostq # gluster peer probe str957-biostq # gluster volume add-brick BigVol replica 3 arbiter 1 str957-biostq:/srv/arbiters/{00..27}/BigVol You obviously have to wait for remove-brick to complete before detaching arbiter. >> IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM) >> that uses an iSCSI disk. More than 80% continuous load on both CPUs and RAM. > That's quite long I must say and I am in the same case as you, my arbiter is > a VM. Give all the CPU and RAM you can. Less than 8GB RAM is asking for troubles (in my case). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
‐‐‐ Original Message ‐‐‐ On Monday, October 26, 2020 3:39 PM, Diego Zuccato wrote: > Memory does not serve me well (there are 28 disks, not 26!), but bash > history does :) Yes, I also too often rely on history ;) > gluster volume remove-brick BigVol replica 2 > str957-biostq:/srv/arbiters/{00..27}/BigVol force Thanks for the info, I was missing the "replica 2" inside the command it looks like > gluster peer detach str957-biostq > gluster peer probe str957-biostq Do I really need to do a detach and re-probe the aribter node? I would like to avoid that because I have two other volumes with even more files... so that would mean that I have to remove the arbiter brick of the two other volumes too... > Give all the CPU and RAM you can. Less than 8GB RAM is asking for > troubles (in my case). I have added an extra 4 GB of RAM just in case. Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
‐‐‐ Original Message ‐‐‐ On Monday, October 26, 2020 2:56 PM, Diego Zuccato wrote: > The volume is built by 26 10TB disks w/ genetic data. I currently don't > have exact numbers, but it's still at the beginning, so there are a bit > less than 10TB actually used. > But you're only removing the arbiters, you always have two copies of > your files. The worst that can happen is a split brain condition > (avoidable by requiring a 2-nodes quorum, in that case the worst is that > the volume goes readonly). Right, seen liked that this sounds reasonable. Do you actually remember the exact command you ran in order to remove the brick? I was thinking this should be it: gluster volume remove-brick force but should I use "force" or "start"? > IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM) > that uses an iSCSI disk. More than 80% continuous load on both CPUs and RAM. That's quite long I must say and I am in the same case as you, my arbiter is a VM. Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
On Monday, October 26, 2020 11:34 AM, Diego Zuccato wrote: > IIRC it's the same issue I had some time ago. > I solved it by "degrading" the volume to replica 2, then cleared the > arbiter bricks and upgraded again to replica 3 arbiter 1. Thanks Diego for pointing out this workaround. How much data do you have on that volume in terms of TB and files? Because I have around 3TB of data in 10 million files. So I am a bit worried of taking such drastic measures. How bad was the load after on your volume when re-adding the arbiter brick? and how long did it take to sync/heal? Would another workaround such as turning off quotas on that problematic volume work? That sounds much less scary but I don't know if that would work... Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Il 26/10/20 15:09, mabi ha scritto: > Right, seen liked that this sounds reasonable. Do you actually remember the > exact command you ran in order to remove the brick? I was thinking this > should be it: > gluster volume remove-brick force > but should I use "force" or "start"? Memory does not serve me well (there are 28 disks, not 26!), but bash history does :) # gluster volume remove-brick BigVol replica 2 str957-biostq:/srv/arbiters/{00..27}/BigVol force # gluster peer detach str957-biostq # gluster peer probe str957-biostq # gluster volume add-brick BigVol replica 3 arbiter 1 str957-biostq:/srv/arbiters/{00..27}/BigVol You obviously have to wait for remove-brick to complete before detaching arbiter. >> IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM) >> that uses an iSCSI disk. More than 80% continuous load on both CPUs and RAM. > That's quite long I must say and I am in the same case as you, my arbiter is > a VM. Give all the CPU and RAM you can. Less than 8GB RAM is asking for troubles (in my case). -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Il 26/10/20 14:46, mabi ha scritto: >> I solved it by "degrading" the volume to replica 2, then cleared the >> arbiter bricks and upgraded again to replica 3 arbiter 1. > Thanks Diego for pointing out this workaround. How much data do you have on > that volume in terms of TB and files? Because I have around 3TB of data in 10 > million files. So I am a bit worried of taking such drastic measures. The volume is built by 26 10TB disks w/ genetic data. I currently don't have exact numbers, but it's still at the beginning, so there are a bit less than 10TB actually used. But you're only removing the arbiters, you always have two copies of your files. The worst that can happen is a split brain condition (avoidable by requiring a 2-nodes quorum, in that case the worst is that the volume goes readonly). > How bad was the load after on your volume when re-adding the arbiter brick? > and how long did it take to sync/heal? IIRC it took about 3 days, but the arbiters are on a VM (8CPU, 8GB RAM) that uses an iSCSI disk. More than 80% continuous load on both CPUs and RAM. > Would another workaround such as turning off quotas on that problematic > volume work? That sounds much less scary but I don't know if that would > work... I don't know, sorry. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Dear all, Thanks to this fix I could successfully upgrade from GlusterFS 6.9 to 7.8 but now, 1 week later after the upgrade, I have rebooted my third node (arbiter node) and unfortunately the bricks do not want to come up on that node. I get the same following error message: [2020-10-26 06:21:59.726705] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node2.domain [2020-10-26 06:21:59.726871] I [MSGID: 106493] [glusterd-handler.c:3715:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to node2.domain (0), ret: 0, op_ret: -1 [2020-10-26 06:21:59.728164] I [MSGID: 106490] [glusterd-handler.c:2434:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5f4ccbf4-33f6-4298-8b31-213553223349 [2020-10-26 06:21:59.728969] E [MSGID: 106012] [glusterd-utils.c:3682:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvol-private differ. local cksum = 0, remote cksum = 66908910 on peer node1.domain [2020-10-26 06:21:59.729099] I [MSGID: 106493] [glusterd-handler.c:3715:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to node1.domain (0), ret: 0, op_ret: -1 Can someone please advise what I need to do in order to have my arbiter node up and running again as soon as possible? Thank you very much in advance for your help. Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On Monday, September 7, 2020 5:49 AM, Sanju Rakonde wrote: > Hi, > > issue https://github.com/gluster/glusterfs/issues/1332 is fixed now with > https://github.com/gluster/glusterfs/commit/865cca1190e233381f975ff36118f46e29477dcf. > > It will be backported to release-7 and release-8 branches soon. > > On Mon, Sep 7, 2020 at 1:14 AM Strahil Nikolov wrote: > >> Your e-mail got in the spam... >> >> If you haven't fixed the issue, check Hari's topic about quota issues (based >> on the error message you provided) : >> https://medium.com/@harigowtham/glusterfs-quota-fix-accounting-840df33fcd3a >> >> Most probably there is a quota issue and you need to fix it. >> >> Best Regards, >> Strahil Nikolov >> >> В неделя, 23 август 2020 г., 11:05:27 Гринуич+3, mabi >> написа: >> >> Hello, >> >> So to be precise I am exactly having the following issue: >> >> https://github.com/gluster/glusterfs/issues/1332 >> >> I could not wait any longer to find some workarounds or quick fixes so I >> decided to downgrade my rejected from 7.7 back to 6.9 which worked. >> >> I would be really glad if someone could fix this issue or provide me a >> workaround which works because version 6 of GlusterFS is not supported >> anymore so I would really like to move on to the stable version 7. >> >> Thank you very much in advance. >> >> Best regards, >> Mabi >> >> ‐‐‐ Original Message ‐‐‐ >> >> On Saturday, August 22, 2020 7:53 PM, mabi wrote: >> >>> Hello, >>> >>> I just started an upgrade of my 3 nodes replica (incl arbiter) of GlusterFS >>> from 6.9 to 7.7 but unfortunately after upgrading the first node, that node >>> gets rejected due to the following error: >>> >>> [2020-08-22 17:43:00.240990] E [MSGID: 106012] >>> [glusterd-utils.c:3537:glusterd_compare_friend_volume] 0-management: Cksums >>> of quota configuration of volume myvolume differ. local cksum = 3013120651, >>> remote cksum = 0 on peer myfirstnode.domain.tld >>> >>> So glusterd process is running but not glusterfsd. >>> >>> I am exactly in the same issue as described here: >>> >>> https://www.gitmemory.com/Adam2Marsh >>> >>> But I do not see any solutions or workaround. So now I am stuck with a >>> degraded GlusterFS cluster. >>> >>> Could someone please advise me as soon as possible on what I should do? Is >>> there maybe any workarounds? >>> >>> Thank you very much in advance for your response. >>> >>> Best regards, >>> Mabi >> >> >> >> Community Meeting Calendar: >> >> Schedule - >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> Bridge: https://bluejeans.com/441850968 >> >> Gluster-users mailing list >> Gluster-users@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> Community Meeting Calendar: >> >> Schedule - >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> Bridge: https://bluejeans.com/441850968 >> >> Gluster-users mailing list >> Gluster-users@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > Thanks, > Sanju Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Il 26/10/20 07:40, mabi ha scritto: > Thanks to this fix I could successfully upgrade from GlusterFS 6.9 to > 7.8 but now, 1 week later after the upgrade, I have rebooted my third > node (arbiter node) and unfortunately the bricks do not want to come up > on that node. I get the same following error message: IIRC it's the same issue I had some time ago. I solved it by "degrading" the volume to replica 2, then cleared the arbiter bricks and upgraded again to replica 3 arbiter 1. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Hi, issue https://github.com/gluster/glusterfs/issues/1332 is fixed now with https://github.com/gluster/glusterfs/commit/865cca1190e233381f975ff36118f46e29477dcf . It will be backported to release-7 and release-8 branches soon. On Mon, Sep 7, 2020 at 1:14 AM Strahil Nikolov wrote: > Your e-mail got in the spam... > > If you haven't fixed the issue, check Hari's topic about quota issues > (based on the error message you provided) : > https://medium.com/@harigowtham/glusterfs-quota-fix-accounting-840df33fcd3a > > Most probably there is a quota issue and you need to fix it. > > Best Regards, > Strahil Nikolov > > > > > > > В неделя, 23 август 2020 г., 11:05:27 Гринуич+3, mabi > написа: > > > > > > Hello, > > So to be precise I am exactly having the following issue: > > https://github.com/gluster/glusterfs/issues/1332 > > I could not wait any longer to find some workarounds or quick fixes so I > decided to downgrade my rejected from 7.7 back to 6.9 which worked. > > I would be really glad if someone could fix this issue or provide me a > workaround which works because version 6 of GlusterFS is not supported > anymore so I would really like to move on to the stable version 7. > > Thank you very much in advance. > > Best regards, > Mabi > > > ‐‐‐ Original Message ‐‐‐ > > On Saturday, August 22, 2020 7:53 PM, mabi wrote: > > > Hello, > > > > I just started an upgrade of my 3 nodes replica (incl arbiter) of > GlusterFS from 6.9 to 7.7 but unfortunately after upgrading the first node, > that node gets rejected due to the following error: > > > > [2020-08-22 17:43:00.240990] E [MSGID: 106012] > [glusterd-utils.c:3537:glusterd_compare_friend_volume] 0-management: Cksums > of quota configuration of volume myvolume differ. local cksum = 3013120651, > remote cksum = 0 on peer myfirstnode.domain.tld > > > > So glusterd process is running but not glusterfsd. > > > > I am exactly in the same issue as described here: > > > > https://www.gitmemory.com/Adam2Marsh > > > > But I do not see any solutions or workaround. So now I am stuck with a > degraded GlusterFS cluster. > > > > Could someone please advise me as soon as possible on what I should do? > Is there maybe any workarounds? > > > > Thank you very much in advance for your response. > > > > Best regards, > > Mabi > > > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -- Thanks, Sanju Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Your e-mail got in the spam... If you haven't fixed the issue, check Hari's topic about quota issues (based on the error message you provided) : https://medium.com/@harigowtham/glusterfs-quota-fix-accounting-840df33fcd3a Most probably there is a quota issue and you need to fix it. Best Regards, Strahil Nikolov В неделя, 23 август 2020 г., 11:05:27 Гринуич+3, mabi написа: Hello, So to be precise I am exactly having the following issue: https://github.com/gluster/glusterfs/issues/1332 I could not wait any longer to find some workarounds or quick fixes so I decided to downgrade my rejected from 7.7 back to 6.9 which worked. I would be really glad if someone could fix this issue or provide me a workaround which works because version 6 of GlusterFS is not supported anymore so I would really like to move on to the stable version 7. Thank you very much in advance. Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On Saturday, August 22, 2020 7:53 PM, mabi wrote: > Hello, > > I just started an upgrade of my 3 nodes replica (incl arbiter) of GlusterFS > from 6.9 to 7.7 but unfortunately after upgrading the first node, that node > gets rejected due to the following error: > > [2020-08-22 17:43:00.240990] E [MSGID: 106012] > [glusterd-utils.c:3537:glusterd_compare_friend_volume] 0-management: Cksums > of quota configuration of volume myvolume differ. local cksum = 3013120651, > remote cksum = 0 on peer myfirstnode.domain.tld > > So glusterd process is running but not glusterfsd. > > I am exactly in the same issue as described here: > > https://www.gitmemory.com/Adam2Marsh > > But I do not see any solutions or workaround. So now I am stuck with a > degraded GlusterFS cluster. > > Could someone please advise me as soon as possible on what I should do? Is > there maybe any workarounds? > > Thank you very much in advance for your response. > > Best regards, > Mabi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Dear Nikhil, Thank you for your answer. So does this mean that all my FUSE clients where I have the volume mounted will not loose at any time their connection during the whole upgrade procedure of all 3 nodes? I am asking because if I understand correctly there will be an overlap of time where more than one node will not be running the glusterfsd (brick) process so this means that the quorum is lost and then my FUSE clients will loose connection to the volume? I just want to be sure that there will not be any downtime. Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On Monday, August 24, 2020 11:14 AM, Nikhil Ladha wrote: > Hello Mabi > > You don't need to follow the offline upgrade procedure. Please do follow the > online upgrade procedure only. Upgrade the nodes one by one, you will notice > the `Peer Rejected` state, after upgrading one node or so, but once all the > nodes are upgraded it will be back to `Peer in Cluster(Connected)`. Also, if > any of the shd's are not online you can try restarting that node to fix that. > I have tried this on my own setup so I am pretty sure, it should work for you > as well. > This is the workaround for the time being so that you are able to upgrade, we > are working on the issue to come up with a fix for it ASAP. > > And, yes if you face any issues even after upgrading all the nodes to 7.7, > you will be able to downgrade in back to 6.9, which I think you have already > tried and it works as per your previous mail. > > Regards > Nikhil Ladha Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Hello Mabi You don't need to follow the offline upgrade procedure. Please do follow the online upgrade procedure only. Upgrade the nodes one by one, you will notice the `Peer Rejected` state, after upgrading one node or so, but once all the nodes are upgraded it will be back to `Peer in Cluster(Connected)`. Also, if any of the shd's are not online you can try restarting that node to fix that. I have tried this on my own setup so I am pretty sure, it should work for you as well. This is the workaround for the time being so that you are able to upgrade, we are working on the issue to come up with a fix for it ASAP. And, yes if you face any issues even after upgrading all the nodes to 7.7, you will be able to downgrade in back to 6.9, which I think you have already tried and it works as per your previous mail. Regards Nikhil Ladha Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Hello, So to be precise I am exactly having the following issue: https://github.com/gluster/glusterfs/issues/1332 I could not wait any longer to find some workarounds or quick fixes so I decided to downgrade my rejected from 7.7 back to 6.9 which worked. I would be really glad if someone could fix this issue or provide me a workaround which works because version 6 of GlusterFS is not supported anymore so I would really like to move on to the stable version 7. Thank you very much in advance. Best regards, Mabi ‐‐‐ Original Message ‐‐‐ On Saturday, August 22, 2020 7:53 PM, mabi wrote: > Hello, > > I just started an upgrade of my 3 nodes replica (incl arbiter) of GlusterFS > from 6.9 to 7.7 but unfortunately after upgrading the first node, that node > gets rejected due to the following error: > > [2020-08-22 17:43:00.240990] E [MSGID: 106012] > [glusterd-utils.c:3537:glusterd_compare_friend_volume] 0-management: Cksums > of quota configuration of volume myvolume differ. local cksum = 3013120651, > remote cksum = 0 on peer myfirstnode.domain.tld > > So glusterd process is running but not glusterfsd. > > I am exactly in the same issue as described here: > > https://www.gitmemory.com/Adam2Marsh > > But I do not see any solutions or workaround. So now I am stuck with a > degraded GlusterFS cluster. > > Could someone please advise me as soon as possible on what I should do? Is > there maybe any workarounds? > > Thank you very much in advance for your response. > > Best regards, > Mabi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Upgrade from 6.9 to 7.7 stuck (peer is rejected)
Hello, I just started an upgrade of my 3 nodes replica (incl arbiter) of GlusterFS from 6.9 to 7.7 but unfortunately after upgrading the first node, that node gets rejected due to the following error: [2020-08-22 17:43:00.240990] E [MSGID: 106012] [glusterd-utils.c:3537:glusterd_compare_friend_volume] 0-management: Cksums of quota configuration of volume myvolume differ. local cksum = 3013120651, remote cksum = 0 on peer myfirstnode.domain.tld So glusterd process is running but not glusterfsd. I am exactly in the same issue as described here: https://www.gitmemory.com/Adam2Marsh But I do not see any solutions or workaround. So now I am stuck with a degraded GlusterFS cluster. Could someone please advise me as soon as possible on what I should do? Is there maybe any workarounds? Thank you very much in advance for your response. Best regards, Mabi Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users