Re: [ceph-users] mirror OSD configuration

2018-02-28 Thread David Turner
A more common search term for this might be Rack failure domain.  The
premise is the same for room as it is for rack, both can hold hosts and be
set as the failure domain.  There is a fair bit of discussion on how to
achieve multi-rack/room/datacenter setups.  Datacenter setups are more
likely to have rules that coincide with only having 2 failure domains like
you have here.  When you read about them just s/datacenter/room/ and you're
good.

I'm going to second the concern of only having 2 failure domains.  Unless
you guarantee 2 copies in each room, you're looking at allowing min_size=1
which is just bad practice and a common way to lose data as witnessed on
this list multiple times.  Running with size=4 and min_size=2 seems to be
your best bet here.  It's a lot of overhead, but you can handle an entire
room going down.  Higher levels of redundancy usually come with some cost,
this is the cost here.

On Wed, Feb 28, 2018 at 12:38 PM Gregory Farnum  wrote:

> On Wed, Feb 28, 2018 at 3:02 AM Zoran Bošnjak <
> zoran.bosn...@sloveniacontrol.si> wrote:
>
>> I am aware of monitor consensus requirement. It is taken care of (there
>> is a third room with only monitor node). My problem is about OSD
>> redundancy, since I can only use 2 server rooms for OSDs.
>>
>> I could use EC-pools, lrc or any other ceph configuration. But I could
>> not find a configuration that would address the issue. The write
>> acknowledge rule should read something like this:
>> 1. If both rooms are "up", do not acknowledge write until ack is received
>> from both rooms.
>> 2. If only one room is "up" (forget rule 1.) acknowledge write on the
>> first ack.
>
>
> This check is performed when PGs go active, not on every write, (once a PG
> goes active, it needs a commit from everybody in the set before writes are
> done, or else to go through peering again) but that is the standard
> behavior for Ceph if you configure CRUSH to place data redundantly in both
> rooms.
>
>
>>
>> The ceph documentation talks about recursively defined locality sets, so
>> I assume it allows for different rules on room/rack/host... levels.
>> But as far as I can see, it can not depend on "room" availability.
>>
>> Is this possible to configure?
>> I would appreciate example configuration commands.
>>
>> regards,
>> Zoran
>>
>> 
>> From: Eino Tuominen 
>> Sent: Wednesday, February 28, 2018 8:47 AM
>> To: Zoran Bošnjak; ceph-us...@ceph.com
>> Subject: Re: mirror OSD configuration
>>
>> > Is it possible to configure crush map such that it will tolerate "room"
>> failure? In my case, there is one
>> > network switch per room and one power supply per room, which makes a
>> single point of (room) failure.
>>
>> Hi,
>>
>> You cannot achieve real room redundancy with just two rooms. At minimum
>> you'll need a third room (witness) from which you'll need independent
>> network connections to the two server rooms. Otherwise it's impossible to
>> have monitor quorum when one of the two rooms fails. And then you'd need to
>> consider osd redundancy. You could do with replica size = 4, min_size = 2
>> (or any min_size = n, size = 2*n ), but that's not perfect as you lose
>> exactly half of the replicas in case of a room failure. If you were able to
>> use EC-pools you'd have more options with LRC coding (
>> http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/).
>>
>> We run ceph in a 3 room configuration with 3 monitors, size=3,
>> min_size=2. It works, but it's not without hassle either.
>>
>> --
>>   Eino Tuominen
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mirror OSD configuration

2018-02-28 Thread Gregory Farnum
On Wed, Feb 28, 2018 at 3:02 AM Zoran Bošnjak <
zoran.bosn...@sloveniacontrol.si> wrote:

> I am aware of monitor consensus requirement. It is taken care of (there is
> a third room with only monitor node). My problem is about OSD redundancy,
> since I can only use 2 server rooms for OSDs.
>
> I could use EC-pools, lrc or any other ceph configuration. But I could not
> find a configuration that would address the issue. The write acknowledge
> rule should read something like this:
> 1. If both rooms are "up", do not acknowledge write until ack is received
> from both rooms.
> 2. If only one room is "up" (forget rule 1.) acknowledge write on the
> first ack.


This check is performed when PGs go active, not on every write, (once a PG
goes active, it needs a commit from everybody in the set before writes are
done, or else to go through peering again) but that is the standard
behavior for Ceph if you configure CRUSH to place data redundantly in both
rooms.


>
> The ceph documentation talks about recursively defined locality sets, so I
> assume it allows for different rules on room/rack/host... levels.
> But as far as I can see, it can not depend on "room" availability.
>
> Is this possible to configure?
> I would appreciate example configuration commands.
>
> regards,
> Zoran
>
> 
> From: Eino Tuominen 
> Sent: Wednesday, February 28, 2018 8:47 AM
> To: Zoran Bošnjak; ceph-us...@ceph.com
> Subject: Re: mirror OSD configuration
>
> > Is it possible to configure crush map such that it will tolerate "room"
> failure? In my case, there is one
> > network switch per room and one power supply per room, which makes a
> single point of (room) failure.
>
> Hi,
>
> You cannot achieve real room redundancy with just two rooms. At minimum
> you'll need a third room (witness) from which you'll need independent
> network connections to the two server rooms. Otherwise it's impossible to
> have monitor quorum when one of the two rooms fails. And then you'd need to
> consider osd redundancy. You could do with replica size = 4, min_size = 2
> (or any min_size = n, size = 2*n ), but that's not perfect as you lose
> exactly half of the replicas in case of a room failure. If you were able to
> use EC-pools you'd have more options with LRC coding (
> http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/).
>
> We run ceph in a 3 room configuration with 3 monitors, size=3, min_size=2.
> It works, but it's not without hassle either.
>
> --
>   Eino Tuominen
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mirror OSD configuration

2018-02-28 Thread Zoran Bošnjak
I am aware of monitor consensus requirement. It is taken care of (there is a 
third room with only monitor node). My problem is about OSD redundancy, since I 
can only use 2 server rooms for OSDs.

I could use EC-pools, lrc or any other ceph configuration. But I could not find 
a configuration that would address the issue. The write acknowledge rule should 
read something like this:
1. If both rooms are "up", do not acknowledge write until ack is received from 
both rooms.
2. If only one room is "up" (forget rule 1.) acknowledge write on the first ack.

The ceph documentation talks about recursively defined locality sets, so I 
assume it allows for different rules on room/rack/host... levels.
But as far as I can see, it can not depend on "room" availability.

Is this possible to configure?
I would appreciate example configuration commands.

regards,
Zoran


From: Eino Tuominen 
Sent: Wednesday, February 28, 2018 8:47 AM
To: Zoran Bošnjak; ceph-us...@ceph.com
Subject: Re: mirror OSD configuration

> Is it possible to configure crush map such that it will tolerate "room" 
> failure? In my case, there is one
> network switch per room and one power supply per room, which makes a single 
> point of (room) failure.

Hi,

You cannot achieve real room redundancy with just two rooms. At minimum you'll 
need a third room (witness) from which you'll need independent network 
connections to the two server rooms. Otherwise it's impossible to have monitor 
quorum when one of the two rooms fails. And then you'd need to consider osd 
redundancy. You could do with replica size = 4, min_size = 2 (or any min_size = 
n, size = 2*n ), but that's not perfect as you lose exactly half of the 
replicas in case of a room failure. If you were able to use EC-pools you'd have 
more options with LRC coding 
(http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/).

We run ceph in a 3 room configuration with 3 monitors, size=3, min_size=2. It 
works, but it's not without hassle either.

--
  Eino Tuominen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mirror OSD configuration

2018-02-27 Thread Eino Tuominen
> Is it possible to configure crush map such that it will tolerate "room" 
> failure? In my case, there is one
> network switch per room and one power supply per room, which makes a single 
> point of (room) failure.

Hi,

You cannot achieve real room redundancy with just two rooms. At minimum you'll 
need a third room (witness) from which you'll need independent network 
connections to the two server rooms. Otherwise it's impossible to have monitor 
quorum when one of the two rooms fails. And then you'd need to consider osd 
redundancy. You could do with replica size = 4, min_size = 2 (or any min_size = 
n, size = 2*n ), but that's not perfect as you lose exactly half of the 
replicas in case of a room failure. If you were able to use EC-pools you'd have 
more options with LRC coding 
(http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/). 

We run ceph in a 3 room configuration with 3 monitors, size=3, min_size=2. It 
works, but it's not without hassle either.

-- 
  Eino Tuominen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mirror OSD configuration

2018-02-27 Thread Zoran Bošnjak
This is my planned OSD configuration:

root
room1
OSD host1
OSD host2
room2
OSD host3
OSD host4

There are 6 OSDs per host.

Is it possible to configure crush map such that it will tolerate "room" 
failure? In my case, there is one network switch per room and one power supply 
per room, which makes a single point of (room) failure. This is what I would 
like to mitigate.

I could not find any crush rule that would make this configuration redundant 
and safe.

Namely, to tolerate a sudden room (switch, power) failure, there must be a rule 
to "ack" write only after BOTH rooms make the "ack". The problem is that this 
rule holds only until both rooms are up. As soon as one room goes down (with 
the rule like this) the cluster won't be able to write any more data since the 
"ack" is not allowed by the rule. It looks like impossible task with a fix 
crush map rule. The cluster would somehow need to switch rules to make this 
redundant. What am I missing?

In general: can ceph tolerate sudden loss of half of the OSDs?
If not, what is the best redundancy I could get out of my configuration?
Is there any workaround with some external tools maybe to detect such failure 
and reconfigure ceph automatically?

regards,
Zoran
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com