Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Kjetil Jørgensen Wed, 22 Mar 2017 19:44:55 -0700

Hi,


I should clarify. When you worry about concurrent osd failures, it's more
likely that the source of that is from i.e. network/rack/power - you'd
organize your osd's spread across those failure domains, and tell crush
that you put each replica in separate failure domains. I.e. you have 3 or
more racks, with their own TOR switches and hopefully power circuits.
You'll tell crush to spread your 3 replicas so that they're in separate
racks.

We do run min_size=2, size=3, although we do run with osd's spread across
multiple racks and require the 3 replicas to be in 3 different racks. Our
reasoning is - two or more machines failing at the same instant that isn't
caused by switch/power is unlikely enough that we'll happily live with it,
it has so far served us well.

-KJ

On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen <kje...@medallia.com>
wrote:

>
> For the most part - I'm assuming min_size=2, size=3. In the min_size=3
> and size=3 this changes.
>
> size is how many replicas of an object to maintain, min_size is how many
> writes need to succeed before the primary can ack the operation to the
> client.
>
> larger min_size most likely higher latency for writes.
>
> On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carhe...@ucar.edu> wrote:
>
>> On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kje...@medallia.com>
>> wrote:
>>
>> >> c. Reads can continue from the single online OSD even in pgs that
>> >> happened to have two of 3 osds offline.
>> >>
>> >
>> > Hypothetically (This is partially informed guessing on my part):
>> > If the survivor happens to be the acting primary and it were up-to-date
>> at
>> > the time,
>> > it can in theory serve reads. (Only the primary serves reads).
>>
>> It makes no sense that only the primary could serve reads. That would
>> mean that even if only a single OSD failed, all PGs for which that OSD
>> was primary would be unreadable.
>>
>
> Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the
> new
> primary. It'll probably check with 3 to determine whether or not it there
> were
> any writes it itself is unaware of - and peer if there were. Promotion
> should
> be near instantaneous (well, you'd in all likelihood be able to measure
> it).
>
>
>> There must be an algorithm to appoint a new primary. So in a 2 OSD
>> failure scenario, a new primary should be appointed after the first
>> failure, no? Would the final remaining OSD not appoint itself as
>> primary after the 2nd failure?
>>
>>
> Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
> you have no guarantee that the survivor have all writes.
>
> Assuming min_size=3 and size=3 - then yes - you're good, the surviving
> osd can safely be promoted - you're severely degraded, but it can safely
> be promoted.
>
> If you genuinely worry about concurrent failures of 2 machines - run with
> min_size=3, the price you pay is slightly increased mean/median latency
> for writes.
>
> This make sense in the context of CEPH's synchronous writes too. A
>> write isn't complete until all 3 OSDs in the PG have the data,
>> correct? So shouldn't any one of them be able to act as primary at any
>> time?
>
>
> See distinction between size and min_size.
>
>
>> I don't see how that would change even if 2 of 3 ODS fail at exactly
>> the same time.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen <kje...@medallia.com>
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>



-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Reply via email to