Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-27 Thread Dave Sherohman
On Tue, Feb 27, 2018 at 05:50:49PM +0530, Karthik Subrahmanya wrote:
> gluster volume add-brick  replica 3 arbiter 1   2> 
> is the command. It will convert the existing volume to arbiter volume and
> add the specified bricks as arbiter bricks to the existing subvols.
> Once they are successfully added, self heal should start automatically and
> you can check the status of heal using the command,
> gluster volume heal  info

OK, done and the heal is in progress.  Thanks again for your help!

-- 
Dave Sherohman
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-27 Thread Karthik Subrahmanya
On Tue, Feb 27, 2018 at 5:35 PM, Dave Sherohman  wrote:

> On Tue, Feb 27, 2018 at 04:59:36PM +0530, Karthik Subrahmanya wrote:
> > > > Since arbiter bricks need not be of same size as the data bricks, if
> you
> > > > can configure three more arbiter bricks
> > > > based on the guidelines in the doc [1], you can do it live and you
> will
> > > > have the distribution count also unchanged.
> > >
> > > I can probably find one or more machines with a few hundred GB free
> > > which could be allocated for arbiter bricks if it would be sigificantly
> > > simpler and safer than repurposing the existing bricks (and I'm getting
> > > the impression that it probably would be).
> >
> > Yes it is the simpler and safer way of doing that.
> >
> > >   Does it particularly matter
> > > whether the arbiters are all on the same node or on three separate
> > > nodes?
> > >
> >  No it doesn't matter as long as the bricks of same replica subvol are
> not
> > on the same nodes.
>
> OK, great.  So basically just install the gluster server on the new
> node(s), do a peer probe to add them to the cluster, and then
>
> gluster volume create palantir replica 3 arbiter 1 [saruman brick]
> [gandalf brick] [arbiter 1] [azathoth brick] [yog-sothoth brick] [arbiter
> 2] [cthulhu brick] [mordiggian brick] [arbiter 3]
>
gluster volume add-brick  replica 3 arbiter 1   
is the command. It will convert the existing volume to arbiter volume and
add the specified bricks as arbiter bricks to the existing subvols.
Once they are successfully added, self heal should start automatically and
you can check the status of heal using the command,
gluster volume heal  info

Regards,
Karthik
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-27 Thread Dave Sherohman
On Tue, Feb 27, 2018 at 04:59:36PM +0530, Karthik Subrahmanya wrote:
> > > Since arbiter bricks need not be of same size as the data bricks, if you
> > > can configure three more arbiter bricks
> > > based on the guidelines in the doc [1], you can do it live and you will
> > > have the distribution count also unchanged.
> >
> > I can probably find one or more machines with a few hundred GB free
> > which could be allocated for arbiter bricks if it would be sigificantly
> > simpler and safer than repurposing the existing bricks (and I'm getting
> > the impression that it probably would be).
> 
> Yes it is the simpler and safer way of doing that.
> 
> >   Does it particularly matter
> > whether the arbiters are all on the same node or on three separate
> > nodes?
> >
>  No it doesn't matter as long as the bricks of same replica subvol are not
> on the same nodes.

OK, great.  So basically just install the gluster server on the new
node(s), do a peer probe to add them to the cluster, and then

gluster volume create palantir replica 3 arbiter 1 [saruman brick] [gandalf 
brick] [arbiter 1] [azathoth brick] [yog-sothoth brick] [arbiter 2] [cthulhu 
brick] [mordiggian brick] [arbiter 3]

Or is there more to it than that?

-- 
Dave Sherohman
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-27 Thread Karthik Subrahmanya
On Tue, Feb 27, 2018 at 4:18 PM, Dave Sherohman  wrote:

> On Tue, Feb 27, 2018 at 03:20:25PM +0530, Karthik Subrahmanya wrote:
> > If you want to use the first two bricks as arbiter, then you need to be
> > aware of the following things:
> > - Your distribution count will be decreased to 2.
>
> What's the significance of this?  I'm trying to find documentation on
> distribution counts in gluster, but my google-fu is failing me.
>
More distribution, better load balancing.

>
> > - Your data on the first subvol i.e., replica subvol - 1 will be
> > unavailable till it is copied to the other subvols
> > after removing the bricks from the cluster.
>
> Hmm, ok.  I was sure I had seen a reference at some point to a command
> for migrating data off bricks to prepare them for removal.
>
> Is there an easy way to get a list of all files which are present on a
> given brick, then, so that I can see which data would be unavailable
> during this transfer?
>
The easiest way is by doing "ls" on the back end brick.

>
> > Since arbiter bricks need not be of same size as the data bricks, if you
> > can configure three more arbiter bricks
> > based on the guidelines in the doc [1], you can do it live and you will
> > have the distribution count also unchanged.
>
> I can probably find one or more machines with a few hundred GB free
> which could be allocated for arbiter bricks if it would be sigificantly
> simpler and safer than repurposing the existing bricks (and I'm getting
> the impression that it probably would be).

Yes it is the simpler and safer way of doing that.

>   Does it particularly matter
> whether the arbiters are all on the same node or on three separate
> nodes?
>
 No it doesn't matter as long as the bricks of same replica subvol are not
on the same nodes.

Regards,
Karthik

>
> --
> Dave Sherohman
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-27 Thread Dave Sherohman
On Tue, Feb 27, 2018 at 03:20:25PM +0530, Karthik Subrahmanya wrote:
> If you want to use the first two bricks as arbiter, then you need to be
> aware of the following things:
> - Your distribution count will be decreased to 2.

What's the significance of this?  I'm trying to find documentation on
distribution counts in gluster, but my google-fu is failing me.

> - Your data on the first subvol i.e., replica subvol - 1 will be
> unavailable till it is copied to the other subvols
> after removing the bricks from the cluster.

Hmm, ok.  I was sure I had seen a reference at some point to a command
for migrating data off bricks to prepare them for removal.

Is there an easy way to get a list of all files which are present on a
given brick, then, so that I can see which data would be unavailable
during this transfer?

> Since arbiter bricks need not be of same size as the data bricks, if you
> can configure three more arbiter bricks
> based on the guidelines in the doc [1], you can do it live and you will
> have the distribution count also unchanged.

I can probably find one or more machines with a few hundred GB free
which could be allocated for arbiter bricks if it would be sigificantly
simpler and safer than repurposing the existing bricks (and I'm getting
the impression that it probably would be).  Does it particularly matter
whether the arbiters are all on the same node or on three separate
nodes?

-- 
Dave Sherohman
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-27 Thread Karthik Subrahmanya
On Tue, Feb 27, 2018 at 1:40 PM, Dave Sherohman  wrote:

> On Tue, Feb 27, 2018 at 12:00:29PM +0530, Karthik Subrahmanya wrote:
> > I will try to explain how you can end up in split-brain even with cluster
> > wide quorum:
>
> Yep, the explanation made sense.  I hadn't considered the possibility of
> alternating outages.  Thanks!
>
> > > > It would be great if you can consider configuring an arbiter or
> > > > replica 3 volume.
> > >
> > > I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
> > > bricks as arbiters with minimal effect on capacity.  What would be the
> > > sequence of commands needed to:
> > >
> > > 1) Move all data off of bricks 1 & 2
> > > 2) Remove that replica from the cluster
> > > 3) Re-add those two bricks as arbiters
> > >
> > > (And did I miss any additional steps?)
> > >
> > > Unfortunately, I've been running a few months already with the current
> > > configuration and there are several virtual machines running off the
> > > existing volume, so I'll need to reconfigure it online if possible.
> > >
> > Without knowing the volume configuration it is difficult to suggest the
> > configuration change,
> > and since it is a live system you may end up in data unavailability or
> data
> > loss.
> > Can you give the output of "gluster volume info "
> > and which brick is of what size.
>
> Volume Name: palantir
> Type: Distributed-Replicate
> Volume ID: 48379a50-3210-41b4-9a77-ae143c8bcac0
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 3 x 2 = 6
> Transport-type: tcp
> Bricks:
> Brick1: saruman:/var/local/brick0/data
> Brick2: gandalf:/var/local/brick0/data
> Brick3: azathoth:/var/local/brick0/data
> Brick4: yog-sothoth:/var/local/brick0/data
> Brick5: cthulhu:/var/local/brick0/data
> Brick6: mordiggian:/var/local/brick0/data
> Options Reconfigured:
> features.scrub: Inactive
> features.bitrot: off
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on
> network.ping-timeout: 1013
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> features.shard: on
> cluster.data-self-heal-algorithm: full
> storage.owner-uid: 64055
> storage.owner-gid: 64055
>
>
> For brick sizes, saruman/gandalf have
>
> $ df -h /var/local/brick0
> Filesystem   Size  Used Avail Use% Mounted on
> /dev/mapper/gandalf-gluster  885G   55G  786G   7% /var/local/brick0
>
> and the other four have
>
> $ df -h /var/local/brick0
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdb111T  254G   11T   3% /var/local/brick0
>

If you want to use the first two bricks as arbiter, then you need to be
aware of the following things:
- Your distribution count will be decreased to 2.
- Your data on the first subvol i.e., replica subvol - 1 will be
unavailable till it is copied to the other subvols
after removing the bricks from the cluster.

Since arbiter bricks need not be of same size as the data bricks, if you
can configure three more arbiter bricks
based on the guidelines in the doc [1], you can do it live and you will
have the distribution count also unchanged.

One more thing from the volume info; Only the options which are
reconfigured will appear in the volume info output.
The quorum-type is in the list which says it is manually reconfigured.

[1]
http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing

Regards,
Karthik

>
>
> --
> Dave Sherohman
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-26 Thread Karthik Subrahmanya
On Mon, Feb 26, 2018 at 6:14 PM, Dave Sherohman  wrote:

> On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote:
> > > "In a replica 2 volume... If we set the client-quorum option to
> > > auto, then the first brick must always be up, irrespective of the
> > > status of the second brick. If only the second brick is up, the
> > > subvolume becomes read-only."
> > >
> > By default client-quorum is "none" in replica 2 volume.
>
> I'm not sure where I saw the directions saying to set it, but I do have
> "cluster.quorum-type: auto" in my volume configuration.  (And I think
> that's client quorum, but feel free to correct me if I've misunderstood
> the docs.)
>
If it is "auto" then I think it is reconfigured. In replica 2 it will be
"none".

>
> > It applies to all the replica 2 volumes even if it has just 2 brick or
> more.
> > Total brick count in the volume doesn't matter for the quorum, what
> matters
> > is the number of bricks which are up in the particular replica subvol.
>
> Thanks for confirming that.
>
> > If I understood your configuration correctly it should look something
> like
> > this:
> > (Please correct me if I am wrong)
> > replica-1:  bricks 1 & 2
> > replica-2: bricks 3 & 4
> > replica-3: bricks 5 & 6
>
> Yes, that's correct.
>
> > Since quorum is per replica, if it is set to auto then it needs the first
> > brick of the particular replica subvol to be up to perform the fop.
> >
> > In replica 2 volumes you can end up in split-brains.
>
> How would that happen if bricks which are not in (cluster-wide) quorum
> refuse to accept writes?  I'm not seeing the reason for using individual
> subvolume quorums instead of full-volume quorum.
>
Split brains happen within the replica pair.
I will try to explain how you can end up in split-brain even with cluster
wide quorum:
Lets say you have 6 bricks (replica 2) volume and you always have at least
quorum number of bricks up & running.
Bricks 1 & 2 are part of replica subvol-1
Bricks 3 & 4 are part of replica subvol-2
Bricks 5 & 6 are part of replica subvol-3

- Brick 1 goes down and a write comes on a file which is part of that
replica subvol-1
- Quorum is met since we have 5 out of 6 bricks are running
- Brick 2 says brick 1 is bad
- Brick 2 goes down and brick 1 comes up. Heal did not happened
- Write comes on the same file, quorum is met, and now brick 1 says brick 2
is bad
- When both the bricks 1 & 2 are up, both of them blame the other brick -
*split-brain*

>
> > It would be great if you can consider configuring an arbiter or
> > replica 3 volume.
>
> I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
> bricks as arbiters with minimal effect on capacity.  What would be the
> sequence of commands needed to:
>
> 1) Move all data off of bricks 1 & 2
> 2) Remove that replica from the cluster
> 3) Re-add those two bricks as arbiters
>
>
(And did I miss any additional steps?)
>
> Unfortunately, I've been running a few months already with the current
> configuration and there are several virtual machines running off the
> existing volume, so I'll need to reconfigure it online if possible.
>
Without knowing the volume configuration it is difficult to suggest the
configuration change,
and since it is a live system you may end up in data unavailability or data
loss.
Can you give the output of "gluster volume info "
and which brick is of what size.
Note: The arbiter bricks need not be of bigger size.
[1] gives information about how you can provision the arbiter brick.

[1]
http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#arbiter-bricks-sizing

Regards,
Karthik

>
> --
> Dave Sherohman
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-26 Thread Dave Sherohman
On Mon, Feb 26, 2018 at 05:45:27PM +0530, Karthik Subrahmanya wrote:
> > "In a replica 2 volume... If we set the client-quorum option to
> > auto, then the first brick must always be up, irrespective of the
> > status of the second brick. If only the second brick is up, the
> > subvolume becomes read-only."
> >
> By default client-quorum is "none" in replica 2 volume.

I'm not sure where I saw the directions saying to set it, but I do have
"cluster.quorum-type: auto" in my volume configuration.  (And I think
that's client quorum, but feel free to correct me if I've misunderstood
the docs.)

> It applies to all the replica 2 volumes even if it has just 2 brick or more.
> Total brick count in the volume doesn't matter for the quorum, what matters
> is the number of bricks which are up in the particular replica subvol.

Thanks for confirming that.

> If I understood your configuration correctly it should look something like
> this:
> (Please correct me if I am wrong)
> replica-1:  bricks 1 & 2
> replica-2: bricks 3 & 4
> replica-3: bricks 5 & 6

Yes, that's correct.

> Since quorum is per replica, if it is set to auto then it needs the first
> brick of the particular replica subvol to be up to perform the fop.
> 
> In replica 2 volumes you can end up in split-brains.

How would that happen if bricks which are not in (cluster-wide) quorum
refuse to accept writes?  I'm not seeing the reason for using individual
subvolume quorums instead of full-volume quorum.

> It would be great if you can consider configuring an arbiter or
> replica 3 volume.

I can.  My bricks are 2x850G and 4x11T, so I can repurpose the small
bricks as arbiters with minimal effect on capacity.  What would be the
sequence of commands needed to:

1) Move all data off of bricks 1 & 2
2) Remove that replica from the cluster
3) Re-add those two bricks as arbiters

(And did I miss any additional steps?)

Unfortunately, I've been running a few months already with the current
configuration and there are several virtual machines running off the
existing volume, so I'll need to reconfigure it online if possible.

-- 
Dave Sherohman
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Quorum in distributed-replicate volume

2018-02-26 Thread Karthik Subrahmanya
Hi Dave,

On Mon, Feb 26, 2018 at 4:45 PM, Dave Sherohman  wrote:

> I've configured 6 bricks as distributed-replicated with replica 2,
> expecting that all active bricks would be usable so long as a quorum of
> at least 4 live bricks is maintained.
>
The client quorum is configured per replica sub volume and not for the
entire volume.
Since you have a distributed-replicated volume with replica 2, the data
will have 2 copies,
and considering your scenario of quorum to be taken on the total number of
bricks will lead to split-brains.

>
> However, I have just found
>
> http://docs.gluster.org/en/latest/Administrator%20Guide/
> Split%20brain%20and%20ways%20to%20deal%20with%20it/
>
> Which states that "In a replica 2 volume... If we set the client-quorum
> option to auto, then the first brick must always be up, irrespective of
> the status of the second brick. If only the second brick is up, the
> subvolume becomes read-only."
>
By default client-quorum is "none" in replica 2 volume.

>
> Does this apply only to a two-brick replica 2 volume or does it apply to
> all replica 2 volumes, even if they have, say, 6 bricks total?
>
It applies to all the replica 2 volumes even if it has just 2 brick or more.
Total brick count in the volume doesn't matter for the quorum, what matters
is the number of bricks which are up in the particular replica subvol.

>
> If it does apply to distributed-replicated volumes with >2 bricks,
> what's the reasoning for it?  I would expect that, if the cluster splits
> into brick 1 by itself and bricks 2-3-4-5-6 still together, then brick 1
> will recognize that it doesn't have volume-wide quorum and reject
> writes, thus allowing brick 2 to remain authoritative and able to accept
> writes.
>
If I understood your configuration correctly it should look something like
this:
(Please correct me if I am wrong)
replica-1:  bricks 1 & 2
replica-2: bricks 3 & 4
replica-3: bricks 5 & 6
Since quorum is per replica, if it is set to auto then it needs the first
brick of the particular replica subvol to be up to perform the fop.

In replica 2 volumes you can end up in split-brains. It would be great if
you can consider configuring an arbiter or replica 3 volume.
You can find more details about their advantages over replica 2 volume in
the same document.

Regards,
Karthik

>
> --
> Dave Sherohman
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Quorum in distributed-replicate volume

2018-02-26 Thread Dave Sherohman
I've configured 6 bricks as distributed-replicated with replica 2,
expecting that all active bricks would be usable so long as a quorum of
at least 4 live bricks is maintained.

However, I have just found

http://docs.gluster.org/en/latest/Administrator%20Guide/Split%20brain%20and%20ways%20to%20deal%20with%20it/

Which states that "In a replica 2 volume... If we set the client-quorum
option to auto, then the first brick must always be up, irrespective of
the status of the second brick. If only the second brick is up, the
subvolume becomes read-only."

Does this apply only to a two-brick replica 2 volume or does it apply to
all replica 2 volumes, even if they have, say, 6 bricks total?

If it does apply to distributed-replicated volumes with >2 bricks,
what's the reasoning for it?  I would expect that, if the cluster splits
into brick 1 by itself and bricks 2-3-4-5-6 still together, then brick 1
will recognize that it doesn't have volume-wide quorum and reject
writes, thus allowing brick 2 to remain authoritative and able to accept
writes.

-- 
Dave Sherohman
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users