Re: [Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 05:50:49PM +0530, Karthik Subrahmanya wrote: > gluster volume add-brick replica 3 arbiter 1 2> > is the command. It will convert the existing volume to arbiter volume and > add the specified bricks as arbiter bricks to the existing subvols. > Once they are successfully added, self heal should start automatically and > you can check the status of heal using the command, > gluster volume heal info OK, done and the heal is in progress. Thanks again for your help! -- Dave Sherohman ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 5:35 PM, Dave Sherohmanwrote: > On Tue, Feb 27, 2018 at 04:59:36PM +0530, Karthik Subrahmanya wrote: > > > > Since arbiter bricks need not be of same size as the data bricks, if > you > > > > can configure three more arbiter bricks > > > > based on the guidelines in the doc [1], you can do it live and you > will > > > > have the distribution count also unchanged. > > > > > > I can probably find one or more machines with a few hundred GB free > > > which could be allocated for arbiter bricks if it would be sigificantly > > > simpler and safer than repurposing the existing bricks (and I'm getting > > > the impression that it probably would be). > > > > Yes it is the simpler and safer way of doing that. > > > > > Does it particularly matter > > > whether the arbiters are all on the same node or on three separate > > > nodes? > > > > > No it doesn't matter as long as the bricks of same replica subvol are > not > > on the same nodes. > > OK, great. So basically just install the gluster server on the new > node(s), do a peer probe to add them to the cluster, and then > > gluster volume create palantir replica 3 arbiter 1 [saruman brick] > [gandalf brick] [arbiter 1] [azathoth brick] [yog-sothoth brick] [arbiter > 2] [cthulhu brick] [mordiggian brick] [arbiter 3] > gluster volume add-brick replica 3 arbiter 1 is the command. It will convert the existing volume to arbiter volume and add the specified bricks as arbiter bricks to the existing subvols. Once they are successfully added, self heal should start automatically and you can check the status of heal using the command, gluster volume heal info Regards, Karthik ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 04:59:36PM +0530, Karthik Subrahmanya wrote: > > > Since arbiter bricks need not be of same size as the data bricks, if you > > > can configure three more arbiter bricks > > > based on the guidelines in the doc [1], you can do it live and you will > > > have the distribution count also unchanged. > > > > I can probably find one or more machines with a few hundred GB free > > which could be allocated for arbiter bricks if it would be sigificantly > > simpler and safer than repurposing the existing bricks (and I'm getting > > the impression that it probably would be). > > Yes it is the simpler and safer way of doing that. > > > Does it particularly matter > > whether the arbiters are all on the same node or on three separate > > nodes? > > > No it doesn't matter as long as the bricks of same replica subvol are not > on the same nodes. OK, great. So basically just install the gluster server on the new node(s), do a peer probe to add them to the cluster, and then gluster volume create palantir replica 3 arbiter 1 [saruman brick] [gandalf brick] [arbiter 1] [azathoth brick] [yog-sothoth brick] [arbiter 2] [cthulhu brick] [mordiggian brick] [arbiter 3] Or is there more to it than that? -- Dave Sherohman ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 4:18 PM, Dave Sherohmanwrote: > On Tue, Feb 27, 2018 at 03:20:25PM +0530, Karthik Subrahmanya wrote: > > If you want to use the first two bricks as arbiter, then you need to be > > aware of the following things: > > - Your distribution count will be decreased to 2. > > What's the significance of this? I'm trying to find documentation on > distribution counts in gluster, but my google-fu is failing me. > More distribution, better load balancing. > > > - Your data on the first subvol i.e., replica subvol - 1 will be > > unavailable till it is copied to the other subvols > > after removing the bricks from the cluster. > > Hmm, ok. I was sure I had seen a reference at some point to a command > for migrating data off bricks to prepare them for removal. > > Is there an easy way to get a list of all files which are present on a > given brick, then, so that I can see which data would be unavailable > during this transfer? > The easiest way is by doing "ls" on the back end brick. > > > Since arbiter bricks need not be of same size as the data bricks, if you > > can configure three more arbiter bricks > > based on the guidelines in the doc [1], you can do it live and you will > > have the distribution count also unchanged. > > I can probably find one or more machines with a few hundred GB free > which could be allocated for arbiter bricks if it would be sigificantly > simpler and safer than repurposing the existing bricks (and I'm getting > the impression that it probably would be). Yes it is the simpler and safer way of doing that. > Does it particularly matter > whether the arbiters are all on the same node or on three separate > nodes? > No it doesn't matter as long as the bricks of same replica subvol are not on the same nodes. Regards, Karthik > > -- > Dave Sherohman > ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 03:20:25PM +0530, Karthik Subrahmanya wrote: > If you want to use the first two bricks as arbiter, then you need to be > aware of the following things: > - Your distribution count will be decreased to 2. What's the significance of this? I'm trying to find documentation on distribution counts in gluster, but my google-fu is failing me. > - Your data on the first subvol i.e., replica subvol - 1 will be > unavailable till it is copied to the other subvols > after removing the bricks from the cluster. Hmm, ok. I was sure I had seen a reference at some point to a command for migrating data off bricks to prepare them for removal. Is there an easy way to get a list of all files which are present on a given brick, then, so that I can see which data would be unavailable during this transfer? > Since arbiter bricks need not be of same size as the data bricks, if you > can configure three more arbiter bricks > based on the guidelines in the doc [1], you can do it live and you will > have the distribution count also unchanged. I can probably find one or more machines with a few hundred GB free which could be allocated for arbiter bricks if it would be sigificantly simpler and safer than repurposing the existing bricks (and I'm getting the impression that it probably would be). Does it particularly matter whether the arbiters are all on the same node or on three separate nodes? -- Dave Sherohman ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Tue, Feb 27, 2018 at 1:40 PM, Dave Sherohmanwrote: > On Tue, Feb 27, 2018 at 12:00:29PM +0530, Karthik Subrahmanya wrote: > > I will try to explain how you can end up in split-brain even with cluster > > wide quorum: > > Yep, the explanation made sense. I hadn't considered the possibility of > alternating outages. Thanks! > > > > > It would be great if you can consider configuring an arbiter or > > > > replica 3 volume. > > > > > > I can. My bricks are 2x850G and 4x11T, so I can repurpose the small > > > bricks as arbiters with minimal effect on capacity. What would be the > > > sequence of commands needed to: > > > > > > 1) Move all data off of bricks 1 & 2 > > > 2) Remove that replica from the cluster > > > 3) Re-add those two bricks as arbiters > > > > > > (And did I miss any additional steps?) > > > > > > Unfortunately, I've been running a few months already with the current > > > configuration and there are several virtual machines running off the > > > existing volume, so I'll need to reconfigure it online if possible. > > > > > Without knowing the volume configuration it is difficult to suggest the > > configuration change, > > and since it is a live system you may end up in data unavailability or > data > > loss. > > Can you give the output of "gluster volume info " > > and which brick is of what size. > > Volume Name: palantir > Type: Distributed-Replicate > Volume ID: 48379a50-3210-41b4-9a77-ae143c8bcac0 > Status: Started > Snapshot Count: 0 > Number of Bricks: 3 x 2 = 6 > Transport-type: tcp > Bricks: > Brick1: saruman:/var/local/brick0/data > Brick2: gandalf:/var/local/brick0/data > Brick3: azathoth:/var/local/brick0/data > Brick4: yog-sothoth:/var/local/brick0/data > Brick5: cthulhu:/var/local/brick0/data > Brick6: mordiggian:/var/local/brick0/data > Options Reconfigured: > features.scrub: Inactive > features.bitrot: off > transport.address-family: inet > performance.readdir-ahead: on > nfs.disable: on > network.ping-timeout: 1013 > performance.quick-read: off > performance.read-ahead: off > performance.io-cache: off > performance.stat-prefetch: off > cluster.eager-lock: enable > network.remote-dio: enable > cluster.quorum-type: auto > cluster.server-quorum-type: server > features.shard: on > cluster.data-self-heal-algorithm: full > storage.owner-uid: 64055 > storage.owner-gid: 64055 > > > For brick sizes, saruman/gandalf have > > $ df -h /var/local/brick0 > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/gandalf-gluster 885G 55G 786G 7% /var/local/brick0 > > and the other four have > > $ df -h /var/local/brick0 > Filesystem Size Used Avail Use% Mounted on > /dev/sdb111T 254G 11T 3% /var/local/brick0 > If you want to use the first two bricks as arbiter, then you need to be aware of the following things: - Your distribution count will be decreased to 2. - Your data on the first subvol i.e., replica subvol - 1 will be unavailable till it is copied to the other subvols after removing the bricks from the cluster. Since arbiter bricks need not be of same size as the data bricks, if you can configure three more arbiter bricks based on the guidelines in the doc [1], you can do it live and you will have the distribution count also unchanged. One more thing from the volume info; Only the options which are reconfigured will appear in the volume info output. The quorum-type is in the list which says it is manually reconfigured. [1] http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing Regards, Karthik > > > -- > Dave Sherohman > ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Mon, Feb 26, 2018 at 6:14 PM, Dave Sherohmanwrote: > On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote: > > > "In a replica 2 volume... If we set the client-quorum option to > > > auto, then the first brick must always be up, irrespective of the > > > status of the second brick. If only the second brick is up, the > > > subvolume becomes read-only." > > > > > By default client-quorum is "none" in replica 2 volume. > > I'm not sure where I saw the directions saying to set it, but I do have > "cluster.quorum-type: auto" in my volume configuration. (And I think > that's client quorum, but feel free to correct me if I've misunderstood > the docs.) > If it is "auto" then I think it is reconfigured. In replica 2 it will be "none". > > > It applies to all the replica 2 volumes even if it has just 2 brick or > more. > > Total brick count in the volume doesn't matter for the quorum, what > matters > > is the number of bricks which are up in the particular replica subvol. > > Thanks for confirming that. > > > If I understood your configuration correctly it should look something > like > > this: > > (Please correct me if I am wrong) > > replica-1: bricks 1 & 2 > > replica-2: bricks 3 & 4 > > replica-3: bricks 5 & 6 > > Yes, that's correct. > > > Since quorum is per replica, if it is set to auto then it needs the first > > brick of the particular replica subvol to be up to perform the fop. > > > > In replica 2 volumes you can end up in split-brains. > > How would that happen if bricks which are not in (cluster-wide) quorum > refuse to accept writes? I'm not seeing the reason for using individual > subvolume quorums instead of full-volume quorum. > Split brains happen within the replica pair. I will try to explain how you can end up in split-brain even with cluster wide quorum: Lets say you have 6 bricks (replica 2) volume and you always have at least quorum number of bricks up & running. Bricks 1 & 2 are part of replica subvol-1 Bricks 3 & 4 are part of replica subvol-2 Bricks 5 & 6 are part of replica subvol-3 - Brick 1 goes down and a write comes on a file which is part of that replica subvol-1 - Quorum is met since we have 5 out of 6 bricks are running - Brick 2 says brick 1 is bad - Brick 2 goes down and brick 1 comes up. Heal did not happened - Write comes on the same file, quorum is met, and now brick 1 says brick 2 is bad - When both the bricks 1 & 2 are up, both of them blame the other brick - *split-brain* > > > It would be great if you can consider configuring an arbiter or > > replica 3 volume. > > I can. My bricks are 2x850G and 4x11T, so I can repurpose the small > bricks as arbiters with minimal effect on capacity. What would be the > sequence of commands needed to: > > 1) Move all data off of bricks 1 & 2 > 2) Remove that replica from the cluster > 3) Re-add those two bricks as arbiters > > (And did I miss any additional steps?) > > Unfortunately, I've been running a few months already with the current > configuration and there are several virtual machines running off the > existing volume, so I'll need to reconfigure it online if possible. > Without knowing the volume configuration it is difficult to suggest the configuration change, and since it is a live system you may end up in data unavailability or data loss. Can you give the output of "gluster volume info " and which brick is of what size. Note: The arbiter bricks need not be of bigger size. [1] gives information about how you can provision the arbiter brick. [1] http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing Regards, Karthik > > -- > Dave Sherohman > ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote: > > "In a replica 2 volume... If we set the client-quorum option to > > auto, then the first brick must always be up, irrespective of the > > status of the second brick. If only the second brick is up, the > > subvolume becomes read-only." > > > By default client-quorum is "none" in replica 2 volume. I'm not sure where I saw the directions saying to set it, but I do have "cluster.quorum-type: auto" in my volume configuration. (And I think that's client quorum, but feel free to correct me if I've misunderstood the docs.) > It applies to all the replica 2 volumes even if it has just 2 brick or more. > Total brick count in the volume doesn't matter for the quorum, what matters > is the number of bricks which are up in the particular replica subvol. Thanks for confirming that. > If I understood your configuration correctly it should look something like > this: > (Please correct me if I am wrong) > replica-1: bricks 1 & 2 > replica-2: bricks 3 & 4 > replica-3: bricks 5 & 6 Yes, that's correct. > Since quorum is per replica, if it is set to auto then it needs the first > brick of the particular replica subvol to be up to perform the fop. > > In replica 2 volumes you can end up in split-brains. How would that happen if bricks which are not in (cluster-wide) quorum refuse to accept writes? I'm not seeing the reason for using individual subvolume quorums instead of full-volume quorum. > It would be great if you can consider configuring an arbiter or > replica 3 volume. I can. My bricks are 2x850G and 4x11T, so I can repurpose the small bricks as arbiters with minimal effect on capacity. What would be the sequence of commands needed to: 1) Move all data off of bricks 1 & 2 2) Remove that replica from the cluster 3) Re-add those two bricks as arbiters (And did I miss any additional steps?) Unfortunately, I've been running a few months already with the current configuration and there are several virtual machines running off the existing volume, so I'll need to reconfigure it online if possible. -- Dave Sherohman ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Quorum in distributed-replicate volume
Hi Dave, On Mon, Feb 26, 2018 at 4:45 PM, Dave Sherohmanwrote: > I've configured 6 bricks as distributed-replicated with replica 2, > expecting that all active bricks would be usable so long as a quorum of > at least 4 live bricks is maintained. > The client quorum is configured per replica sub volume and not for the entire volume. Since you have a distributed-replicated volume with replica 2, the data will have 2 copies, and considering your scenario of quorum to be taken on the total number of bricks will lead to split-brains. > > However, I have just found > > http://docs.gluster.org/en/latest/Administrator%20Guide/ > Split%20brain%20and%20ways%20to%20deal%20with%20it/ > > Which states that "In a replica 2 volume... If we set the client-quorum > option to auto, then the first brick must always be up, irrespective of > the status of the second brick. If only the second brick is up, the > subvolume becomes read-only." > By default client-quorum is "none" in replica 2 volume. > > Does this apply only to a two-brick replica 2 volume or does it apply to > all replica 2 volumes, even if they have, say, 6 bricks total? > It applies to all the replica 2 volumes even if it has just 2 brick or more. Total brick count in the volume doesn't matter for the quorum, what matters is the number of bricks which are up in the particular replica subvol. > > If it does apply to distributed-replicated volumes with >2 bricks, > what's the reasoning for it? I would expect that, if the cluster splits > into brick 1 by itself and bricks 2-3-4-5-6 still together, then brick 1 > will recognize that it doesn't have volume-wide quorum and reject > writes, thus allowing brick 2 to remain authoritative and able to accept > writes. > If I understood your configuration correctly it should look something like this: (Please correct me if I am wrong) replica-1: bricks 1 & 2 replica-2: bricks 3 & 4 replica-3: bricks 5 & 6 Since quorum is per replica, if it is set to auto then it needs the first brick of the particular replica subvol to be up to perform the fop. In replica 2 volumes you can end up in split-brains. It would be great if you can consider configuring an arbiter or replica 3 volume. You can find more details about their advantages over replica 2 volume in the same document. Regards, Karthik > > -- > Dave Sherohman > ___ > Gluster-users mailing list > Gluster-users@gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users > ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Quorum in distributed-replicate volume
I've configured 6 bricks as distributed-replicated with replica 2, expecting that all active bricks would be usable so long as a quorum of at least 4 live bricks is maintained. However, I have just found http://docs.gluster.org/en/latest/Administrator%20Guide/Split%20brain%20and%20ways%20to%20deal%20with%20it/ Which states that "In a replica 2 volume... If we set the client-quorum option to auto, then the first brick must always be up, irrespective of the status of the second brick. If only the second brick is up, the subvolume becomes read-only." Does this apply only to a two-brick replica 2 volume or does it apply to all replica 2 volumes, even if they have, say, 6 bricks total? If it does apply to distributed-replicated volumes with >2 bricks, what's the reasoning for it? I would expect that, if the cluster splits into brick 1 by itself and bricks 2-3-4-5-6 still together, then brick 1 will recognize that it doesn't have volume-wide quorum and reject writes, thus allowing brick 2 to remain authoritative and able to accept writes. -- Dave Sherohman ___ Gluster-users mailing list Gluster-users@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-users