Re: [ceph-users] Two datacenter resilient design with a quorum site

2018-01-18 Thread Gregory Farnum
On Thu, Jan 18, 2018 at 5:57 AM, Alex Gorbachev  
wrote:
> On Tue, Jan 16, 2018 at 2:17 PM, Gregory Farnum  wrote:
>> On Tue, Jan 16, 2018 at 6:07 AM Alex Gorbachev 
>> wrote:
>>>
>>> I found a few WAN RBD cluster design discussions, but not a local one,
>>> so was wonderinng if anyone has experience with a resilience-oriented
>>> short distance (<10 km, redundant fiber connections) cluster in two
>>> datacenters with a third site for quorum purposes only?
>>>
>>> I can see two types of scenarios:
>>>
>>> 1. Two (or even number) of OSD nodes at each site, 4x replication
>>> (size 4, min_size 2).  Three MONs, one at each site to handle split
>>> brain.
>>>
>>> Question: How does the cluster handle the loss of communication
>>> between the OSD sites A and B, while both can communicate with the
>>> quorum site C?  It seems, one of the sites should suspend, as OSDs
>>> will not be able to communicate between sites.
>>
>>
>> Sadly this won't work — the OSDs on each side will report their peers on the
>> other side down, but both will be able to connect to a live monitor.
>> (Assuming the quorum site holds the leader monitor, anyway — if one of the
>> main sites holds what should be the leader, you'll get into a monitor
>> election storm instead.) You'll need your own netsplit monitoring to shut
>> down one site if that kind of connection cut is a possibility.
>
> What about running a split brain aware too, such as Pacemaker, and
> running a copy of the same VM as a mon at each site?  In case of a
> split brain network separation, Pacemaker would (aware via third site)
> stop the mon on site A and bring up the mon on site B (or whatever the
> rules are set to).  I read earlier that a mon with the same IP, name
> and keyring would just look to the ceph cluster as a very old mon, but
> still able to vote for quorum.

It probably is, but don't do that: just use your network monitoring to
shut down the site you've decided is less important. No need to try
and replace its monitor on the primary site or anything like that. (It
would leave you with a mess when trying to restore the secondary
site!)
If you're worried about handling an additional monitor failures, you
can do two per site (plus quorum tiebreaker).
-Greg

>
> Vincent Godin also described an HSRP based method, which would
> accomplish this goal via network routing.  That seems like a good
> approach too, I just need to check on HSRP availability.
>
>>
>>>
>>>
>>> 2. 3x replication for performance or cost (size 3, min_size 2 - or
>>> even min_size 1 and strict monitoring).  Two replicas and two MONs at
>>> one site and one replica and one MON at the other site.
>>>
>>> Question: in case of a permanent failure of the main site (with two
>>> replicas), how to manually force the other site (with one replica and
>>> MON) to provide storage?  I would think a CRUSH map change and
>>> modifying ceph.conf to include just one MON, then build two more MONs
>>> locally and add?
>>
>>
>> Yep, pretty much that. You won't need to change ceph.conf to just one mon so
>> much as to include the current set of mons and update the monmap. I believe
>> that process is in the disaster recovery section of the docs.
>
> Thank you.
>
> Alex
>
>> -Greg
>>
>>>
>>>
>>> --
>>> Alex Gorbachev
>>> Storcium
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Two datacenter resilient design with a quorum site

2018-01-18 Thread Alex Gorbachev
On Tue, Jan 16, 2018 at 2:17 PM, Gregory Farnum  wrote:
> On Tue, Jan 16, 2018 at 6:07 AM Alex Gorbachev 
> wrote:
>>
>> I found a few WAN RBD cluster design discussions, but not a local one,
>> so was wonderinng if anyone has experience with a resilience-oriented
>> short distance (<10 km, redundant fiber connections) cluster in two
>> datacenters with a third site for quorum purposes only?
>>
>> I can see two types of scenarios:
>>
>> 1. Two (or even number) of OSD nodes at each site, 4x replication
>> (size 4, min_size 2).  Three MONs, one at each site to handle split
>> brain.
>>
>> Question: How does the cluster handle the loss of communication
>> between the OSD sites A and B, while both can communicate with the
>> quorum site C?  It seems, one of the sites should suspend, as OSDs
>> will not be able to communicate between sites.
>
>
> Sadly this won't work — the OSDs on each side will report their peers on the
> other side down, but both will be able to connect to a live monitor.
> (Assuming the quorum site holds the leader monitor, anyway — if one of the
> main sites holds what should be the leader, you'll get into a monitor
> election storm instead.) You'll need your own netsplit monitoring to shut
> down one site if that kind of connection cut is a possibility.

What about running a split brain aware too, such as Pacemaker, and
running a copy of the same VM as a mon at each site?  In case of a
split brain network separation, Pacemaker would (aware via third site)
stop the mon on site A and bring up the mon on site B (or whatever the
rules are set to).  I read earlier that a mon with the same IP, name
and keyring would just look to the ceph cluster as a very old mon, but
still able to vote for quorum.

Vincent Godin also described an HSRP based method, which would
accomplish this goal via network routing.  That seems like a good
approach too, I just need to check on HSRP availability.

>
>>
>>
>> 2. 3x replication for performance or cost (size 3, min_size 2 - or
>> even min_size 1 and strict monitoring).  Two replicas and two MONs at
>> one site and one replica and one MON at the other site.
>>
>> Question: in case of a permanent failure of the main site (with two
>> replicas), how to manually force the other site (with one replica and
>> MON) to provide storage?  I would think a CRUSH map change and
>> modifying ceph.conf to include just one MON, then build two more MONs
>> locally and add?
>
>
> Yep, pretty much that. You won't need to change ceph.conf to just one mon so
> much as to include the current set of mons and update the monmap. I believe
> that process is in the disaster recovery section of the docs.

Thank you.

Alex

> -Greg
>
>>
>>
>> --
>> Alex Gorbachev
>> Storcium
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Two datacenter resilient design with a quorum site

2018-01-16 Thread Gregory Farnum
On Tue, Jan 16, 2018 at 6:07 AM Alex Gorbachev 
wrote:

> I found a few WAN RBD cluster design discussions, but not a local one,
> so was wonderinng if anyone has experience with a resilience-oriented
> short distance (<10 km, redundant fiber connections) cluster in two
> datacenters with a third site for quorum purposes only?
>
> I can see two types of scenarios:
>
> 1. Two (or even number) of OSD nodes at each site, 4x replication
> (size 4, min_size 2).  Three MONs, one at each site to handle split
> brain.
>
> Question: How does the cluster handle the loss of communication
> between the OSD sites A and B, while both can communicate with the
> quorum site C?  It seems, one of the sites should suspend, as OSDs
> will not be able to communicate between sites.
>

Sadly this won't work — the OSDs on each side will report their peers on
the other side down, but both will be able to connect to a live monitor.
(Assuming the quorum site holds the leader monitor, anyway — if one of the
main sites holds what should be the leader, you'll get into a monitor
election storm instead.) You'll need your own netsplit monitoring to shut
down one site if that kind of connection cut is a possibility.


>
> 2. 3x replication for performance or cost (size 3, min_size 2 - or
> even min_size 1 and strict monitoring).  Two replicas and two MONs at
> one site and one replica and one MON at the other site.
>
> Question: in case of a permanent failure of the main site (with two
> replicas), how to manually force the other site (with one replica and
> MON) to provide storage?  I would think a CRUSH map change and
> modifying ceph.conf to include just one MON, then build two more MONs
> locally and add?
>

Yep, pretty much that. You won't need to change ceph.conf to just one mon
so much as to include the current set of mons and update the monmap. I
believe that process is in the disaster recovery section of the docs.
-Greg


>
> --
> Alex Gorbachev
> Storcium
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Two datacenter resilient design with a quorum site

2018-01-16 Thread Alex Gorbachev
I found a few WAN RBD cluster design discussions, but not a local one,
so was wonderinng if anyone has experience with a resilience-oriented
short distance (<10 km, redundant fiber connections) cluster in two
datacenters with a third site for quorum purposes only?

I can see two types of scenarios:

1. Two (or even number) of OSD nodes at each site, 4x replication
(size 4, min_size 2).  Three MONs, one at each site to handle split
brain.

Question: How does the cluster handle the loss of communication
between the OSD sites A and B, while both can communicate with the
quorum site C?  It seems, one of the sites should suspend, as OSDs
will not be able to communicate between sites.

2. 3x replication for performance or cost (size 3, min_size 2 - or
even min_size 1 and strict monitoring).  Two replicas and two MONs at
one site and one replica and one MON at the other site.

Question: in case of a permanent failure of the main site (with two
replicas), how to manually force the other site (with one replica and
MON) to provide storage?  I would think a CRUSH map change and
modifying ceph.conf to include just one MON, then build two more MONs
locally and add?

--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com