[ceph-users] Re: Why is min_size of erasure pools set to k+1

Wesley Dillingham Mon, 20 Nov 2023 11:42:29 -0800

" if min_size is k and you lose an OSD during recovery after a failure of m
OSDs, data will become unavailable"

In that situation data wouldnt become unavailable it would be lost.

Having a min_size of k+1 provides a buffer between data being
active+writeable and where data is lost. That inbetween is called inactive.

By having that buffer you prevent the situation of having data being
written to the PG when you are only one disk/shard away from data loss.

Imagine the scenario of 4+2 with a min_size of 4. The cluster is 6 servers
filled with osds

You have brought 2 servers down for maintenance (not a good idea but this
is an example). Your PGs are all degraded with only 4 shards of clean data
but active because k=min_size. Data is being written to the pool.

As you are booting your 2 servers up out of maintenance an OSD/disk on
another server fails and fails hard. Because that OSD was part of the
acting set the cluster only wrote four shards and now one is lost.

You only have 3 shards of data in a 4+2 and now some subset of data is lost.

Now imagine a 4+2 with min_size = 5.

You wouldnt bring down more than 1 host because "ceph osd ok-to-stop" would
return false if your tried to bring down more than 1 host for maintenance.

Lets say you did bring down two hosts against the advice of the ok-to-stop
command.... your PGs would become inactive and so they wouldn't accept
writes. Once you boot your 2 servers back the cluster heals.

Lets say you heed the advice of ok-to-stop and only bring 1 host down for
maintenance at a time.  Your data is degraded with 5/6 shards healthy. New
data is being written with 5 shards able to be written out.

As you are booting your server out of maintenance an OSD on another host
dies and those shards are lost forever, The PGs from that lost OSD now have
4  healthy shards. That is enough shards to recover the data from (though
you would have some PGs inactive for a bit until recovery finished)

Hope this helps to answer the min_size question a bit.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Mon, Nov 20, 2023 at 2:03 PM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Could someone help me understand why it's a bad idea to set min_size of
> erasure-coded pools to k?
>
> From what I've read, the argument for k+1 is that if min_size is k and you
> lose an OSD during recovery after a failure of m OSDs, data will become
> unavailable. But how does setting min_size to k+1 help? If m=2, if you
> experience a double failure followed by another failure during recovery you
> still lost 3 OSDs and therefore your data because the pool wasn't set up to
> handle 3 concurrent failures, and the value of min_size is irrelevant.
>
> https://github.com/ceph/ceph/pull/8008 mentions inability to peer if
> min_size = k, but I don't understand why. Does that mean that if min_size=k
> and I lose m OSDs, and then an OSD is restarted during recovery, PGs will
> not peer even after the restarted OSD comes back online?
>
>
> Vlad
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Why is min_size of erasure pools set to k+1

Reply via email to