Re: [ceph-users] fast_read in EC pools

Caspar Smit Tue, 27 Feb 2018 05:46:23 -0800

Oliver,

Here's the commit info:


https://github.com/ceph/ceph/commit/48e40fcde7b19bab98821ab8d604eab920591284

Caspar

2018-02-27 14:28 GMT+01:00 Oliver Freyermuth <freyerm...@physik.uni-bonn.de>
:

> Am 27.02.2018 um 14:16 schrieb Caspar Smit:
> > Oliver,
> >
> > Be aware that for k=4,m=2 the min_size will be 5 (k+1), so after a node
> failure the min_size is already reached.
> > Any OSD failure beyond the node failure will probably result in some
> PG's to be become incomplete (I/O Freeze) until the incomplete PG's data is
> recovered to another OSD in that node.
> >
> > So please reconsider your statement "one host + x safety" as the x
> safety (with I/O freeze) is probably not what you want.
> >
> > Forcing to run with min_size=4 could also be dangerous for other
> reasons. (there's a reason why min_size = k+1)
>
> Thanks for pointing this out!
> Yes, indeed, in case we need to take down a host for a longer period (we
> would hope this never has to happen for > 24 hours... but you never know),
> and in case disks start to fail, we would indeed have to degrade to
> min_size=4 to keep running.
>
> What exactly are the implications?
> It should still be possible to ensure the data is not corrupt (with the
> checksums), and recovery to k+1 copies should start automatically once a
> disk fails -
> so what's the actual implication?
> Of course pg repair can not work in that case (if a PG for which the
> additional disk failed is corrupted),
> but in general, when there's the need to reinstall a host, we'd try to
> bring it back with OSD data intact -
> which should then allow to postpone the repair until that point.
>
> Is there a danger I miss in my reasoning?
>
> Cheers and many thanks!
>         Oliver
>
> >
> > Caspar
> >
> > 2018-02-27 0:17 GMT+01:00 Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>>:
> >
> >     Am 27.02.2018 um 00:10 schrieb Gregory Farnum:
> >     > On Mon, Feb 26, 2018 at 2:59 PM Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-bonn.de>
> <mailto:freyerm...@physik.uni-bonn.de <mailto:freyerm...@physik.uni-
> bonn.de>>> wrote:
> >     >
> >     >
> >     >     >     Does this match expectations?
> >     >     >
> >     >     >
> >     >     > Can you get the output of eg "ceph pg 2.7cd query"? Want to
> make sure the backfilling versus acting sets and things are correct.
> >     >
> >     >     You'll find attached:
> >     >     query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs
> are up and everything is healthy.
> >     >     query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs
> 164-195 (one host) are down and out.
> >     >
> >     >
> >     > Yep, that's what we want to see. So when everything's well, we
> have OSDs 91, 63, 33, 163, 192, 103. That corresponds to chassis 3, 2, 1,
> 5, 6, 4.
> >     >
> >     > When marking out a host, we have OSDs 91, 63, 33, 163, 123,
> UNMAPPED. That corresponds to chassis 3, 2, 1, 5, 4, UNMAPPED.
> >     >
> >     > So what's happened is that with the new map, when choosing the
> home for shard 4, we selected host 4 instead of host 6 (which is gone). And
> now shard 5 can't map properly. But of course we still have shard 5
> available on host 4, so host 4 is going to end up properly owning shard 4,
> but also just carrying that shard 5 around as a remapped location.
> >     >
> >     > So this is as we expect. Whew.
> >     > -Greg
> >
> >     Understood. Thanks for explaining step by step :-).
> >     It's of course a bit weird that this happens, since in the end, this
> really means data is moved (or rather, a shard is recreated) and taking up
> space without increasing redundancy
> >     (well, it might, if it lands on a different OSD than shard 5, but
> that's not really ensured). I'm unsure if this can be solved "better" in
> any way.
> >
> >     Anyways, it seems this would be another reason why running with
> k+m=number of hosts should not be a general recommendation. For us, it's
> fine for now,
> >     especially since we want to keep the cluster open for later
> extension with more OSDs, and we do now know the gotchas - and I don't see
> a better EC configuration at the moment
> >     which would accomodate our wishes (one host + x safety, don't reduce
> space too much).
> >
> >     So thanks again!
> >
> >     Cheers,
> >             Oliver
> >
> >
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

Reply via email to