Hi,
I should clarify. When you worry about concurrent osd failures, it's more likely that the source of that is from i.e. network/rack/power - you'd organize your osd's spread across those failure domains, and tell crush that you put each replica in separate failure domains. I.e. you have 3 or more racks, with their own TOR switches and hopefully power circuits. You'll tell crush to spread your 3 replicas so that they're in separate racks. We do run min_size=2, size=3, although we do run with osd's spread across multiple racks and require the 3 replicas to be in 3 different racks. Our reasoning is - two or more machines failing at the same instant that isn't caused by switch/power is unlikely enough that we'll happily live with it, it has so far served us well. -KJ On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen <kje...@medallia.com> wrote: > > For the most part - I'm assuming min_size=2, size=3. In the min_size=3 > and size=3 this changes. > > size is how many replicas of an object to maintain, min_size is how many > writes need to succeed before the primary can ack the operation to the > client. > > larger min_size most likely higher latency for writes. > > On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carhe...@ucar.edu> wrote: > >> On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kje...@medallia.com> >> wrote: >> >> >> c. Reads can continue from the single online OSD even in pgs that >> >> happened to have two of 3 osds offline. >> >> >> > >> > Hypothetically (This is partially informed guessing on my part): >> > If the survivor happens to be the acting primary and it were up-to-date >> at >> > the time, >> > it can in theory serve reads. (Only the primary serves reads). >> >> It makes no sense that only the primary could serve reads. That would >> mean that even if only a single OSD failed, all PGs for which that OSD >> was primary would be unreadable. >> > > Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the > new > primary. It'll probably check with 3 to determine whether or not it there > were > any writes it itself is unaware of - and peer if there were. Promotion > should > be near instantaneous (well, you'd in all likelihood be able to measure > it). > > >> There must be an algorithm to appoint a new primary. So in a 2 OSD >> failure scenario, a new primary should be appointed after the first >> failure, no? Would the final remaining OSD not appoint itself as >> primary after the 2nd failure? >> >> > Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant, > you have no guarantee that the survivor have all writes. > > Assuming min_size=3 and size=3 - then yes - you're good, the surviving > osd can safely be promoted - you're severely degraded, but it can safely > be promoted. > > If you genuinely worry about concurrent failures of 2 machines - run with > min_size=3, the price you pay is slightly increased mean/median latency > for writes. > > This make sense in the context of CEPH's synchronous writes too. A >> write isn't complete until all 3 OSDs in the PG have the data, >> correct? So shouldn't any one of them be able to act as primary at any >> time? > > > See distinction between size and min_size. > > >> I don't see how that would change even if 2 of 3 ODS fail at exactly >> the same time. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Kjetil Joergensen <kje...@medallia.com> > SRE, Medallia Inc > Phone: +1 (650) 739-6580 <(650)%20739-6580> > -- Kjetil Joergensen <kje...@medallia.com> SRE, Medallia Inc Phone: +1 (650) 739-6580
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com