[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

Kristof Coucke Mon, 26 Oct 2020 04:53:01 -0700

Hi Frank,

We're having a lot of small objects in the cluster... RocksDb has issues
with the compaction causing high disk load... That's why we are performing
manual compaction...
See https://github.com/ceph/ceph/pull/37496


Br,

Kristof


Op ma 26 okt. 2020 om 12:14 schreef Frank Schilder <fr...@dtu.dk>:

> Hi Kristof,
>
> I missed that: why do you need to do manual compaction?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Kristof Coucke <kristof.cou...@gmail.com>
> Sent: 26 October 2020 11:33:52
> To: Frank Schilder; a.jazdzew...@googlemail.com
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Question about expansion existing Ceph cluster -
> adding OSDs
>
> Hi Ansgar, Frank, all,
>
> Thanks for the feedback in the first place.
>
> In the meantime, I've added all the disks and the cluster is rebalancing
> itself... Which will take ages as you've mentioned. Last week after this
> conversation it was around 50% (little bit more), today it's around 44,5%.
> Every day, I have to take the cluster down to run manual compaction on
> some disks :-(, but that's a known bug where Igor is working on. (Kudos to
> him when I get my sleep back at night for this one...)
>
> Though, I'm still having an issue which I don't completely understand.
> When I look into the Ceph dashboard - OSDs, I can see the #pgs for a
> specific OSD. Does someone know how this is calculated? Because it seems
> incorrect...
> E.g. A specific disk shows in the dashboard 189 PGs...? However, examining
> the pg dump output I can see that for that particular disk there are 145
> PGs where the disk is in the "up" list, and 168 disks where that particular
> disk is in the "acting" list...  Of those 2 lists, 135 are in common,
> meaning 10 PGs will need to be moved to that disk, while 33 PGs will need
> to be moved away...
> I can't figure out how the dashboard is getting to the figure of 189...
> It's also on other disks (a delta between the PG dump output and the info
> in the Ceph dashboard).
>
> Another example is one disk which I've put on weight 0 as it's marked to
> have a predictable failure in the future... So the list with "up" is 0
> (which is correct), and the PGs where this disk is in acting is 49. So,
> this seems correct as these 49 PGs need to be moved away. However...
> Looking into the Ceph dashboard the UI is saying that there are 71 PGs on
> that disk...
>
> So:
> - How does the Ceph dashboard get that number in the 1st place?
> - Is there a possibility that there are "orphaned" PG-parts left behind on
> a particular OSD?
> - If it is possible that there are orphaned parts of a PG left behind on a
> disk, how do I clean this up?
>
> I've also tried examining the osdmap, however, the output seems to be
> limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the
> file is concatenated by exporting the osd map, or by the osdmaptool
> --print).
>
> The cluster is running Nautilus v14.2.11, all on the same version.
>
> I'll make some time writing documentation and documenting my findings
> which I've all faced in the journey of the last 2 weeks.... Kristof in
> Ceph's wunderland...
>
> Thanks for all your input so far!
>
> Regards,
>
> Kristof
>
>
>
> Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder <fr...@dtu.dk<mailto:
> fr...@dtu.dk>>:
> There have been threads on exactly this. Might depend a bit on your ceph
> version. We are running mimic and have no issues doing:
>
> - set noout, norebalance, nobackfill
> - add all OSDs (with weight 1)
> - wait for peering to complete
> - unset all flags and let the rebalance loose
>
> Starting with nautilus there seem to be issues with this procedure. Mainly
> the peering phase can cause a collapse of the cluster.  In your case, it
> sounds like you added the OSDs already. You should be able to do relatively
> safely:
>
> - set noout, norebalance, nobackfill
> - set weight of OSDs to 1 one by one and wait for peering to complete
> every time
> - unset all flags and let the rebalance loose
>
> I believe once the peering succeeded without crashes, the rebalancing will
> just work fine. You can easily control how much rebalancing is going on.
>
> I noted that ceph seems to have a strange concept of priority though. I
> needed to gain capacity by adding OSDs and ceph was very consequent with
> moving PGs from the fullest OSDs last. The opposite of what should happen.
> Thus, it took ages for additional capacity to become available and also the
> backfill too full warnings stayed for all the time. You can influence this
> to some degree by using force_recovery commands on PGs on the fullest OSDs.
>
> Best regards and good luck,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Kristof Coucke <kristof.cou...@gmail.com<mailto:
> kristof.cou...@gmail.com>>
> Sent: 21 October 2020 13:29:00
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Question about expansion existing Ceph cluster -
> adding OSDs
>
> Hi,
>
> I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
> Some disks were near full.
> The new disks have been added with initial weight = 0.
> The original plan was to increase this slowly towards their full weight
> using the gentle reweight script. However, this is going way too slow and
> I'm also having issues now with "backfill_toofull".
> Can I just add all the OSDs with their full weight, or will I get a lot of
> issues when I'm doing that?
> I know that a lot of PGs will have to be replaced, but increasing the
> weight slowly will take a year at the current speed. I'm already playing
> with the max backfill to increase the speed, but every time I increase the
> weight it will take a lot of time again...
> I can face the fact that there will be a performance decrease.
>
> Looking forward to your comments!
>
> Regards,
>
> Kristof
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:
> ceph-users-le...@ceph.io>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

Reply via email to