[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

Frank Schilder Mon, 26 Oct 2020 08:15:47 -0700

Hi Kristof,

I missed that: why do you need to do manual compaction?


Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kristof Coucke <kristof.cou...@gmail.com>
Sent: 26 October 2020 11:33:52
To: Frank Schilder; a.jazdzew...@googlemail.com
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Question about expansion existing Ceph cluster - 
adding OSDs

Hi Ansgar, Frank, all,

Thanks for the feedback in the first place.

In the meantime, I've added all the disks and the cluster is rebalancing 
itself... Which will take ages as you've mentioned. Last week after this 
conversation it was around 50% (little bit more), today it's around 44,5%.
Every day, I have to take the cluster down to run manual compaction on some 
disks :-(, but that's a known bug where Igor is working on. (Kudos to him when 
I get my sleep back at night for this one...)

Though, I'm still having an issue which I don't completely understand.
When I look into the Ceph dashboard - OSDs, I can see the #pgs for a specific 
OSD. Does someone know how this is calculated? Because it seems incorrect...
E.g. A specific disk shows in the dashboard 189 PGs...? However, examining the 
pg dump output I can see that for that particular disk there are 145 PGs where 
the disk is in the "up" list, and 168 disks where that particular disk is in 
the "acting" list...  Of those 2 lists, 135 are in common, meaning 10 PGs will 
need to be moved to that disk, while 33 PGs will need to be moved away...
I can't figure out how the dashboard is getting to the figure of 189... It's 
also on other disks (a delta between the PG dump output and the info in the 
Ceph dashboard).

Another example is one disk which I've put on weight 0 as it's marked to have a 
predictable failure in the future... So the list with "up" is 0 (which is 
correct), and the PGs where this disk is in acting is 49. So, this seems 
correct as these 49 PGs need to be moved away. However... Looking into the Ceph 
dashboard the UI is saying that there are 71 PGs on that disk...

So:
- How does the Ceph dashboard get that number in the 1st place?
- Is there a possibility that there are "orphaned" PG-parts left behind on a 
particular OSD?
- If it is possible that there are orphaned parts of a PG left behind on a 
disk, how do I clean this up?

I've also tried examining the osdmap, however, the output seems to be 
limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the file is 
concatenated by exporting the osd map, or by the osdmaptool --print).

The cluster is running Nautilus v14.2.11, all on the same version.

I'll make some time writing documentation and documenting my findings which 
I've all faced in the journey of the last 2 weeks.... Kristof in Ceph's 
wunderland...

Thanks for all your input so far!

Regards,

Kristof



Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder 
<fr...@dtu.dk<mailto:fr...@dtu.dk>>:
There have been threads on exactly this. Might depend a bit on your ceph 
version. We are running mimic and have no issues doing:

- set noout, norebalance, nobackfill
- add all OSDs (with weight 1)
- wait for peering to complete
- unset all flags and let the rebalance loose

Starting with nautilus there seem to be issues with this procedure. Mainly the 
peering phase can cause a collapse of the cluster.  In your case, it sounds 
like you added the OSDs already. You should be able to do relatively safely:

- set noout, norebalance, nobackfill
- set weight of OSDs to 1 one by one and wait for peering to complete every time
- unset all flags and let the rebalance loose

I believe once the peering succeeded without crashes, the rebalancing will just 
work fine. You can easily control how much rebalancing is going on.

I noted that ceph seems to have a strange concept of priority though. I needed 
to gain capacity by adding OSDs and ceph was very consequent with moving PGs 
from the fullest OSDs last. The opposite of what should happen. Thus, it took 
ages for additional capacity to become available and also the backfill too full 
warnings stayed for all the time. You can influence this to some degree by 
using force_recovery commands on PGs on the fullest OSDs.

Best regards and good luck,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Kristof Coucke <kristof.cou...@gmail.com<mailto:kristof.cou...@gmail.com>>
Sent: 21 October 2020 13:29:00
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Question about expansion existing Ceph cluster - adding 
OSDs

Hi,

I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs.
Some disks were near full.
The new disks have been added with initial weight = 0.
The original plan was to increase this slowly towards their full weight
using the gentle reweight script. However, this is going way too slow and
I'm also having issues now with "backfill_toofull".
Can I just add all the OSDs with their full weight, or will I get a lot of
issues when I'm doing that?
I know that a lot of PGs will have to be replaced, but increasing the
weight slowly will take a year at the current speed. I'm already playing
with the max backfill to increase the speed, but every time I increase the
weight it will take a lot of time again...
I can face the fact that there will be a performance decrease.

Looking forward to your comments!

Regards,

Kristof
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

Reply via email to