[ceph-users] Re: Impact of large PG splits

2024-04-29 Thread Eugen Block

The split process completed over the weekend, the balancer did a great job:

MIN PGs | MAX PGs | MIN USE % | MAX USE %
322 | 338 | 73,3  | 75,5

Although the number of PGs per OSD differs a bit the usage per OSD is  
quite good (and more important). The new hardware also arrived, so  
there will be soon some more remapping. :-)

So I would consider this thread as closed, all good.

Zitat von Eugen Block :

No, we didn’t change much, just increased the max pg per osd to  
avoid warnings and inactive PGs in case a node would fail during  
this process. And the max backfills, of course.


Zitat von Frédéric Nass :


Hello Eugen,

Thanks for sharing the good news. Did you have to raise  
mon_osd_nearfull_ratio temporarily?


Frédéric.

- Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit :


For those interested, just a short update: the split process is
approaching its end, two days ago there were around 230 PGs left
(target are 4096 PGs). So far there were no complaints, no cluster
impact was reported (the cluster load is quite moderate, but still
sensitive). Every now and then a single OSD (not the same) reaches 85%
nearfull ratio, but that was expected since the first nearfull OSD was
the root cause of this operation. I expect the balancer to kick in as
soon as the backfill has completed or when there are less than 5%
misplaced objects.

Zitat von Anthony D'Atri :


One can up the ratios temporarily but it's all too easy to forget to
reduce them later, or think that it's okay to run all the time with
reduced headroom.

Until a host blows up and you don't have enough space to recover into.


On Apr 12, 2024, at 05:01, Frédéric Nass
 wrote:


Oh, and yeah, considering "The fullest OSD is already at 85% usage"
best move for now would be to add new hardware/OSDs (to avoid
reaching the backfill too full limit), prior to start the splitting
PGs before or after enabling upmap balancer depending on how the
PGs got rebalanced (well enough or not) after adding new OSDs.

BTW, what ceph version is this? You should make sure you're running
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass
frederic.n...@univ-lorraine.fr a écrit :


Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
daemon osd.0
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they
do have a real
impact on the recovery/backfilling speed. Just lower  
osd_max_backfills to 1

before doing that.
If mClock scheduler then you might want to use a specific  
mClock profile as
suggested by Gregory (as osd_recovery_sleep* are not considered  
when using

mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs  
(!) and this

cluster only has 240, increasing osd_max_backfills to any values
higher than
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :


Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange :


We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.


We didn't know how long to expect it all to take, but decided 

[ceph-users] Re: Impact of large PG splits

2024-04-25 Thread Eugen Block
No, we didn’t change much, just increased the max pg per osd to avoid  
warnings and inactive PGs in case a node would fail during this  
process. And the max backfills, of course.


Zitat von Frédéric Nass :


Hello Eugen,

Thanks for sharing the good news. Did you have to raise  
mon_osd_nearfull_ratio temporarily?


Frédéric.

- Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit :


For those interested, just a short update: the split process is
approaching its end, two days ago there were around 230 PGs left
(target are 4096 PGs). So far there were no complaints, no cluster
impact was reported (the cluster load is quite moderate, but still
sensitive). Every now and then a single OSD (not the same) reaches 85%
nearfull ratio, but that was expected since the first nearfull OSD was
the root cause of this operation. I expect the balancer to kick in as
soon as the backfill has completed or when there are less than 5%
misplaced objects.

Zitat von Anthony D'Atri :


One can up the ratios temporarily but it's all too easy to forget to
reduce them later, or think that it's okay to run all the time with
reduced headroom.

Until a host blows up and you don't have enough space to recover into.


On Apr 12, 2024, at 05:01, Frédéric Nass
 wrote:


Oh, and yeah, considering "The fullest OSD is already at 85% usage"
best move for now would be to add new hardware/OSDs (to avoid
reaching the backfill too full limit), prior to start the splitting
PGs before or after enabling upmap balancer depending on how the
PGs got rebalanced (well enough or not) after adding new OSDs.

BTW, what ceph version is this? You should make sure you're running
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass
frederic.n...@univ-lorraine.fr a écrit :


Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
daemon osd.0
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they
do have a real
impact on the recovery/backfilling speed. Just lower  
osd_max_backfills to 1

before doing that.
If mClock scheduler then you might want to use a specific mClock  
profile as
suggested by Gregory (as osd_recovery_sleep* are not considered  
when using

mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs  
(!) and this

cluster only has 240, increasing osd_max_backfills to any values
higher than
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :


Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange :


We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.


We didn't know how long to expect it all to take, but decided that
any improvement in PG size was worth starting. We now estimate it
will take another 2-3 weeks to complete, for a total of 4-5 weeks
total.

We have lost a drive or two during the process, and of course
degraded objects went up, and more backfilling work got going. We
paused splits for at least one of those, to make sure the degraded
objects were sorted out as quick as possible. We can't be sure it
went any faster 

[ceph-users] Re: Impact of large PG splits

2024-04-25 Thread Frédéric Nass

Hello Eugen,

Thanks for sharing the good news. Did you have to raise mon_osd_nearfull_ratio 
temporarily? 

Frédéric.

- Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit :

> For those interested, just a short update: the split process is
> approaching its end, two days ago there were around 230 PGs left
> (target are 4096 PGs). So far there were no complaints, no cluster
> impact was reported (the cluster load is quite moderate, but still
> sensitive). Every now and then a single OSD (not the same) reaches 85%
> nearfull ratio, but that was expected since the first nearfull OSD was
> the root cause of this operation. I expect the balancer to kick in as
> soon as the backfill has completed or when there are less than 5%
> misplaced objects.
> 
> Zitat von Anthony D'Atri :
> 
>> One can up the ratios temporarily but it's all too easy to forget to
>> reduce them later, or think that it's okay to run all the time with
>> reduced headroom.
>>
>> Until a host blows up and you don't have enough space to recover into.
>>
>>> On Apr 12, 2024, at 05:01, Frédéric Nass
>>>  wrote:
>>>
>>>
>>> Oh, and yeah, considering "The fullest OSD is already at 85% usage"
>>> best move for now would be to add new hardware/OSDs (to avoid
>>> reaching the backfill too full limit), prior to start the splitting
>>> PGs before or after enabling upmap balancer depending on how the
>>> PGs got rebalanced (well enough or not) after adding new OSDs.
>>>
>>> BTW, what ceph version is this? You should make sure you're running
>>> v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
>>> https://tracker.ceph.com/issues/53729
>>>
>>> Cheers,
>>> Frédéric.
>>>
>>> - Le 12 Avr 24, à 10:41, Frédéric Nass
>>> frederic.n...@univ-lorraine.fr a écrit :
>>>
 Hello Eugen,

 Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
 daemon osd.0
 config show | grep osd_op_queue)

 If WPQ, you might want to tune osd_recovery_sleep* values as they
 do have a real
 impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
 before doing that.
 If mClock scheduler then you might want to use a specific mClock profile as
 suggested by Gregory (as osd_recovery_sleep* are not considered when using
 mClock).

 Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
 cluster only has 240, increasing osd_max_backfills to any values
 higher than
 2-3 will not help much with the recovery/backfilling speed.

 All the way, you'll have to be patient. :-)

 Cheers,
 Frédéric.

 - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :

> Thank you for input!
> We started the split with max_backfills = 1 and watched for a few
> minutes, then gradually increased it to 8. Now it's backfilling with
> around 180 MB/s, not really much but since client impact has to be
> avoided if possible, we decided to let that run for a couple of hours.
> Then reevaluate the situation and maybe increase the backfills a bit
> more.
>
> Thanks!
>
> Zitat von Gregory Orange :
>
>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>> objects. We are splitting for the same reason as you - improved
>> balance. We also thought long and hard before we began, concerned
>> about impact, stability etc.
>>
>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>> retain some control and stop it again fairly quickly if we weren't
>> happy with the behaviour. It also serves to limit the performance
>> impact on the cluster, but unfortunately it also makes the whole
>> process slower.
>>
>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>> issues with the cluster. We could go higher, but are not in a rush
>> at this point. Sometimes nearfull osd warnings get high and MAX
>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>> interrupt it. So, we set pg_num to whatever the current value is
>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>> gets to work once the misplaced objects drop below the ratio, and
>> things balance out. Nearfull osds drop usually to zero, and MAX
>> AVAIL goes up again.
>>
>> The above behaviour is because while they share the same threshold
>> setting, the autoscaler only runs every minute, and it won't run
>> when misplaced are over the threshold. Meanwhile, checks for the
>> next PG to split happen much more frequently, so the balancer never
>> wins that race.
>>
>>
>> We didn't know how long to expect it all to take, but decided that
>> any improvement in PG size was worth starting. We now estimate it
>> will take another 2-3 weeks to complete, for a total of 

[ceph-users] Re: Impact of large PG splits

2024-04-25 Thread Eugen Block
For those interested, just a short update: the split process is  
approaching its end, two days ago there were around 230 PGs left  
(target are 4096 PGs). So far there were no complaints, no cluster  
impact was reported (the cluster load is quite moderate, but still  
sensitive). Every now and then a single OSD (not the same) reaches 85%  
nearfull ratio, but that was expected since the first nearfull OSD was  
the root cause of this operation. I expect the balancer to kick in as  
soon as the backfill has completed or when there are less than 5%  
misplaced objects.


Zitat von Anthony D'Atri :

One can up the ratios temporarily but it's all too easy to forget to  
reduce them later, or think that it's okay to run all the time with  
reduced headroom.


Until a host blows up and you don't have enough space to recover into.

On Apr 12, 2024, at 05:01, Frédéric Nass  
 wrote:



Oh, and yeah, considering "The fullest OSD is already at 85% usage"  
best move for now would be to add new hardware/OSDs (to avoid  
reaching the backfill too full limit), prior to start the splitting  
PGs before or after enabling upmap balancer depending on how the  
PGs got rebalanced (well enough or not) after adding new OSDs.


BTW, what ceph version is this? You should make sure you're running  
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:  
https://tracker.ceph.com/issues/53729


Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass  
frederic.n...@univ-lorraine.fr a écrit :



Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph  
daemon osd.0

config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they  
do have a real

impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
before doing that.
If mClock scheduler then you might want to use a specific mClock profile as
suggested by Gregory (as osd_recovery_sleep* are not considered when using
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
cluster only has 240, increasing osd_max_backfills to any values  
higher than

2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :


Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange :


We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.


We didn't know how long to expect it all to take, but decided that
any improvement in PG size was worth starting. We now estimate it
will take another 2-3 weeks to complete, for a total of 4-5 weeks
total.

We have lost a drive or two during the process, and of course
degraded objects went up, and more backfilling work got going. We
paused splits for at least one of those, to make sure the degraded
objects were sorted out as quick as possible. We can't be sure it
went any faster though - there's always a long tail on that sort of
thing.

Inconsistent objects are found at least a couple of times a week,
and to get them repairing we disable scrubs, wait until they're
stopped, then set the repair going and reenable scrubs. I don't know
if this is special to the current higher splitting load, but we
haven't noticed it before.

HTH,
Greg.



[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Anthony D'Atri
One can up the ratios temporarily but it's all too easy to forget to reduce 
them later, or think that it's okay to run all the time with reduced headroom.

Until a host blows up and you don't have enough space to recover into.

> On Apr 12, 2024, at 05:01, Frédéric Nass  
> wrote:
> 
> 
> Oh, and yeah, considering "The fullest OSD is already at 85% usage" best move 
> for now would be to add new hardware/OSDs (to avoid reaching the backfill too 
> full limit), prior to start the splitting PGs before or after enabling upmap 
> balancer depending on how the PGs got rebalanced (well enough or not) after 
> adding new OSDs.
> 
> BTW, what ceph version is this? You should make sure you're running v16.2.11+ 
> or v17.2.4+ before splitting PGs to avoid this nasty bug: 
> https://tracker.ceph.com/issues/53729
> 
> Cheers,
> Frédéric.
> 
> - Le 12 Avr 24, à 10:41, Frédéric Nass frederic.n...@univ-lorraine.fr a 
> écrit :
> 
>> Hello Eugen,
>> 
>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon 
>> osd.0
>> config show | grep osd_op_queue)
>> 
>> If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
>> real
>> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
>> before doing that.
>> If mClock scheduler then you might want to use a specific mClock profile as
>> suggested by Gregory (as osd_recovery_sleep* are not considered when using
>> mClock).
>> 
>> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
>> cluster only has 240, increasing osd_max_backfills to any values higher than
>> 2-3 will not help much with the recovery/backfilling speed.
>> 
>> All the way, you'll have to be patient. :-)
>> 
>> Cheers,
>> Frédéric.
>> 
>> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
>> 
>>> Thank you for input!
>>> We started the split with max_backfills = 1 and watched for a few
>>> minutes, then gradually increased it to 8. Now it's backfilling with
>>> around 180 MB/s, not really much but since client impact has to be
>>> avoided if possible, we decided to let that run for a couple of hours.
>>> Then reevaluate the situation and maybe increase the backfills a bit
>>> more.
>>> 
>>> Thanks!
>>> 
>>> Zitat von Gregory Orange :
>>> 
 We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
 with NVME RocksDB, used exclusively for RGWs, holding about 60b
 objects. We are splitting for the same reason as you - improved
 balance. We also thought long and hard before we began, concerned
 about impact, stability etc.
 
 We set target_max_misplaced_ratio to 0.1% initially, so we could
 retain some control and stop it again fairly quickly if we weren't
 happy with the behaviour. It also serves to limit the performance
 impact on the cluster, but unfortunately it also makes the whole
 process slower.
 
 We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
 issues with the cluster. We could go higher, but are not in a rush
 at this point. Sometimes nearfull osd warnings get high and MAX
 AVAIL on the data pool in `ceph df` gets low enough that we want to
 interrupt it. So, we set pg_num to whatever the current value is
 (ceph osd pool ls detail), and let it stabilise. Then the balancer
 gets to work once the misplaced objects drop below the ratio, and
 things balance out. Nearfull osds drop usually to zero, and MAX
 AVAIL goes up again.
 
 The above behaviour is because while they share the same threshold
 setting, the autoscaler only runs every minute, and it won't run
 when misplaced are over the threshold. Meanwhile, checks for the
 next PG to split happen much more frequently, so the balancer never
 wins that race.
 
 
 We didn't know how long to expect it all to take, but decided that
 any improvement in PG size was worth starting. We now estimate it
 will take another 2-3 weeks to complete, for a total of 4-5 weeks
 total.
 
 We have lost a drive or two during the process, and of course
 degraded objects went up, and more backfilling work got going. We
 paused splits for at least one of those, to make sure the degraded
 objects were sorted out as quick as possible. We can't be sure it
 went any faster though - there's always a long tail on that sort of
 thing.
 
 Inconsistent objects are found at least a couple of times a week,
 and to get them repairing we disable scrubs, wait until they're
 stopped, then set the repair going and reenable scrubs. I don't know
 if this is special to the current higher splitting load, but we
 haven't noticed it before.
 
 HTH,
 Greg.
 
 
 On 10/4/24 14:42, Eugen Block wrote:
> Thank you, Janne.
> I believe the default 5% target_max_misplaced_ratio would work as
> well, we've had good experience with that in the past, without the

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Eugen Block

Thanks for chiming in.
They are on version 16.2.13 (I was already made aware of the bug you  
mentioned, thanks!) with wpq.
Until now I haven't got an emergency call so I assume everything is  
calm (I hope). New hardware has been ordered but it will take a couple  
of weeks until it's delivered, installed and integrated, that's why we  
decided to take action now.

I'll update the thread when I know more.

Thanks again!
Eugen

Zitat von Frédéric Nass :

Oh, and yeah, considering "The fullest OSD is already at 85% usage"  
best move for now would be to add new hardware/OSDs (to avoid  
reaching the backfill too full limit), prior to start the splitting  
PGs before or after enabling upmap balancer depending on how the PGs  
got rebalanced (well enough or not) after adding new OSDs.


BTW, what ceph version is this? You should make sure you're running  
v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:  
https://tracker.ceph.com/issues/53729


Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass  
frederic.n...@univ-lorraine.fr a écrit :



Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph  
daemon osd.0

config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they  
do have a real

impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
before doing that.
If mClock scheduler then you might want to use a specific mClock profile as
suggested by Gregory (as osd_recovery_sleep* are not considered when using
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
cluster only has 240, increasing osd_max_backfills to any values higher than
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :


Thank you for input!
We started the split with max_backfills = 1 and watched for a few
minutes, then gradually increased it to 8. Now it's backfilling with
around 180 MB/s, not really much but since client impact has to be
avoided if possible, we decided to let that run for a couple of hours.
Then reevaluate the situation and maybe increase the backfills a bit
more.

Thanks!

Zitat von Gregory Orange :


We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
with NVME RocksDB, used exclusively for RGWs, holding about 60b
objects. We are splitting for the same reason as you - improved
balance. We also thought long and hard before we began, concerned
about impact, stability etc.

We set target_max_misplaced_ratio to 0.1% initially, so we could
retain some control and stop it again fairly quickly if we weren't
happy with the behaviour. It also serves to limit the performance
impact on the cluster, but unfortunately it also makes the whole
process slower.

We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
issues with the cluster. We could go higher, but are not in a rush
at this point. Sometimes nearfull osd warnings get high and MAX
AVAIL on the data pool in `ceph df` gets low enough that we want to
interrupt it. So, we set pg_num to whatever the current value is
(ceph osd pool ls detail), and let it stabilise. Then the balancer
gets to work once the misplaced objects drop below the ratio, and
things balance out. Nearfull osds drop usually to zero, and MAX
AVAIL goes up again.

The above behaviour is because while they share the same threshold
setting, the autoscaler only runs every minute, and it won't run
when misplaced are over the threshold. Meanwhile, checks for the
next PG to split happen much more frequently, so the balancer never
wins that race.


We didn't know how long to expect it all to take, but decided that
any improvement in PG size was worth starting. We now estimate it
will take another 2-3 weeks to complete, for a total of 4-5 weeks
total.

We have lost a drive or two during the process, and of course
degraded objects went up, and more backfilling work got going. We
paused splits for at least one of those, to make sure the degraded
objects were sorted out as quick as possible. We can't be sure it
went any faster though - there's always a long tail on that sort of
thing.

Inconsistent objects are found at least a couple of times a week,
and to get them repairing we disable scrubs, wait until they're
stopped, then set the repair going and reenable scrubs. I don't know
if this is special to the current higher splitting load, but we
haven't noticed it before.

HTH,
Greg.


On 10/4/24 14:42, Eugen Block wrote:

Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as
well, we've had good experience with that in the past, without the
autoscaler. I just haven't dealt with such large PGs, I've been
warning them for two years (when the PGs were only almost half this
size) and now they finally started to listen. Well, they would
still ignore it if it wouldn't impact all kinds of 

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Frédéric Nass

Oh, and yeah, considering "The fullest OSD is already at 85% usage" best move 
for now would be to add new hardware/OSDs (to avoid reaching the backfill too 
full limit), prior to start the splitting PGs before or after enabling upmap 
balancer depending on how the PGs got rebalanced (well enough or not) after 
adding new OSDs.

BTW, what ceph version is this? You should make sure you're running v16.2.11+ 
or v17.2.4+ before splitting PGs to avoid this nasty bug: 
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass frederic.n...@univ-lorraine.fr a 
écrit :

> Hello Eugen,
> 
> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon 
> osd.0
> config show | grep osd_op_queue)
> 
> If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
> real
> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
> before doing that.
> If mClock scheduler then you might want to use a specific mClock profile as
> suggested by Gregory (as osd_recovery_sleep* are not considered when using
> mClock).
> 
> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
> cluster only has 240, increasing osd_max_backfills to any values higher than
> 2-3 will not help much with the recovery/backfilling speed.
> 
> All the way, you'll have to be patient. :-)
> 
> Cheers,
> Frédéric.
> 
> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
> 
>> Thank you for input!
>> We started the split with max_backfills = 1 and watched for a few
>> minutes, then gradually increased it to 8. Now it's backfilling with
>> around 180 MB/s, not really much but since client impact has to be
>> avoided if possible, we decided to let that run for a couple of hours.
>> Then reevaluate the situation and maybe increase the backfills a bit
>> more.
>> 
>> Thanks!
>> 
>> Zitat von Gregory Orange :
>> 
>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>>> objects. We are splitting for the same reason as you - improved
>>> balance. We also thought long and hard before we began, concerned
>>> about impact, stability etc.
>>>
>>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>>> retain some control and stop it again fairly quickly if we weren't
>>> happy with the behaviour. It also serves to limit the performance
>>> impact on the cluster, but unfortunately it also makes the whole
>>> process slower.
>>>
>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>>> issues with the cluster. We could go higher, but are not in a rush
>>> at this point. Sometimes nearfull osd warnings get high and MAX
>>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>>> interrupt it. So, we set pg_num to whatever the current value is
>>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>>> gets to work once the misplaced objects drop below the ratio, and
>>> things balance out. Nearfull osds drop usually to zero, and MAX
>>> AVAIL goes up again.
>>>
>>> The above behaviour is because while they share the same threshold
>>> setting, the autoscaler only runs every minute, and it won't run
>>> when misplaced are over the threshold. Meanwhile, checks for the
>>> next PG to split happen much more frequently, so the balancer never
>>> wins that race.
>>>
>>>
>>> We didn't know how long to expect it all to take, but decided that
>>> any improvement in PG size was worth starting. We now estimate it
>>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>>> total.
>>>
>>> We have lost a drive or two during the process, and of course
>>> degraded objects went up, and more backfilling work got going. We
>>> paused splits for at least one of those, to make sure the degraded
>>> objects were sorted out as quick as possible. We can't be sure it
>>> went any faster though - there's always a long tail on that sort of
>>> thing.
>>>
>>> Inconsistent objects are found at least a couple of times a week,
>>> and to get them repairing we disable scrubs, wait until they're
>>> stopped, then set the repair going and reenable scrubs. I don't know
>>> if this is special to the current higher splitting load, but we
>>> haven't noticed it before.
>>>
>>> HTH,
>>> Greg.
>>>
>>>
>>> On 10/4/24 14:42, Eugen Block wrote:
 Thank you, Janne.
 I believe the default 5% target_max_misplaced_ratio would work as
 well, we've had good experience with that in the past, without the
 autoscaler. I just haven't dealt with such large PGs, I've been
 warning them for two years (when the PGs were only almost half this
 size) and now they finally started to listen. Well, they would
 still ignore it if it wouldn't impact all kinds of things now. ;-)

 Thanks,
 Eugen

 Zitat von Janne Johansson :

> Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
>> I'm 

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Frédéric Nass

Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon osd.0 
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
real impact on the recovery/backfilling speed. Just lower osd_max_backfills to 
1 before doing that.
If mClock scheduler then you might want to use a specific mClock profile as 
suggested by Gregory (as osd_recovery_sleep* are not considered when using 
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this 
cluster only has 240, increasing osd_max_backfills to any values higher than 
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :

> Thank you for input!
> We started the split with max_backfills = 1 and watched for a few
> minutes, then gradually increased it to 8. Now it's backfilling with
> around 180 MB/s, not really much but since client impact has to be
> avoided if possible, we decided to let that run for a couple of hours.
> Then reevaluate the situation and maybe increase the backfills a bit
> more.
> 
> Thanks!
> 
> Zitat von Gregory Orange :
> 
>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>> objects. We are splitting for the same reason as you - improved
>> balance. We also thought long and hard before we began, concerned
>> about impact, stability etc.
>>
>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>> retain some control and stop it again fairly quickly if we weren't
>> happy with the behaviour. It also serves to limit the performance
>> impact on the cluster, but unfortunately it also makes the whole
>> process slower.
>>
>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>> issues with the cluster. We could go higher, but are not in a rush
>> at this point. Sometimes nearfull osd warnings get high and MAX
>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>> interrupt it. So, we set pg_num to whatever the current value is
>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>> gets to work once the misplaced objects drop below the ratio, and
>> things balance out. Nearfull osds drop usually to zero, and MAX
>> AVAIL goes up again.
>>
>> The above behaviour is because while they share the same threshold
>> setting, the autoscaler only runs every minute, and it won't run
>> when misplaced are over the threshold. Meanwhile, checks for the
>> next PG to split happen much more frequently, so the balancer never
>> wins that race.
>>
>>
>> We didn't know how long to expect it all to take, but decided that
>> any improvement in PG size was worth starting. We now estimate it
>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>> total.
>>
>> We have lost a drive or two during the process, and of course
>> degraded objects went up, and more backfilling work got going. We
>> paused splits for at least one of those, to make sure the degraded
>> objects were sorted out as quick as possible. We can't be sure it
>> went any faster though - there's always a long tail on that sort of
>> thing.
>>
>> Inconsistent objects are found at least a couple of times a week,
>> and to get them repairing we disable scrubs, wait until they're
>> stopped, then set the repair going and reenable scrubs. I don't know
>> if this is special to the current higher splitting load, but we
>> haven't noticed it before.
>>
>> HTH,
>> Greg.
>>
>>
>> On 10/4/24 14:42, Eugen Block wrote:
>>> Thank you, Janne.
>>> I believe the default 5% target_max_misplaced_ratio would work as
>>> well, we've had good experience with that in the past, without the
>>> autoscaler. I just haven't dealt with such large PGs, I've been
>>> warning them for two years (when the PGs were only almost half this
>>> size) and now they finally started to listen. Well, they would
>>> still ignore it if it wouldn't impact all kinds of things now. ;-)
>>>
>>> Thanks,
>>> Eugen
>>>
>>> Zitat von Janne Johansson :
>>>
 Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
> I'm trying to estimate the possible impact when large PGs are
> splitted. Here's one example of such a PG:
>
> PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG
> DISK_LOG    UP
> 86.3ff    277708  414403098409    0   0  3092
> 3092
> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]

 If you ask for small increases of pg_num, it will only split that many
 PGs at a time, so while there will be a lot of data movement, (50% due
 to half of the data needs to go to another newly made PG, and on top
 of that, PGs per OSD will change, but also the balancing can now work
 better) it will not be affecting the whole cluster if you increase
 with 

[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Gregory Orange
Setting osd_max_backfills at much more than 1 on HDD spinners seems 
anathema to me, and I recall reading others saying the same thing. 
That's because seek time is a major constraint on them, so keeping 
activity as contiguous as possible is going to help performance. Maybe 
pushing it to 2-3 is okay, but we haven't seen a lot of throughput 
benefit. YMMV.


The major aggregate speed improver for us was to increase 
target_max_misplaced_ratio because of the increased parallelism it 
induces. Also changing osd_mclock_profile is useful, at the following 
rough ratios, being aware that it can impact client traffic:

* high_client_ops 100%
* balanced 150%
* high_recovery_ops 200%

I've just read the help again (thank you whoever implemented ceph config 
help ...) and have been reminded again that due to primary and 
non-primary reservations, setting it to e.g. 1 means it could see 2 
shards doing recovery IO on the same OSD.



On 10/4/24 18:54, Eugen Block wrote:

Thank you for input!
We started the split with max_backfills = 1 and watched for a few 
minutes, then gradually increased it to 8. Now it's backfilling with 
around 180 MB/s, not really much but since client impact has to be 
avoided if possible, we decided to let that run for a couple of hours. 
Then reevaluate the situation and maybe increase the backfills a bit more.


Thanks!

Zitat von Gregory Orange :

We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs 
with NVME RocksDB, used exclusively for RGWs, holding about 60b 
objects. We are splitting for the same reason as you - improved 
balance. We also thought long and hard before we began, concerned 
about impact, stability etc.


We set target_max_misplaced_ratio to 0.1% initially, so we could 
retain some control and stop it again fairly quickly if we weren't 
happy with the behaviour. It also serves to limit the performance 
impact on the cluster, but unfortunately it also makes the whole 
process slower.


We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No 
issues with the cluster. We could go higher, but are not in a rush at 
this point. Sometimes nearfull osd warnings get high and MAX AVAIL on 
the data pool in `ceph df` gets low enough that we want to interrupt 
it. So, we set pg_num to whatever the current value is (ceph osd pool 
ls detail), and let it stabilise. Then the balancer gets to work once 
the misplaced objects drop below the ratio, and things balance out. 
Nearfull osds drop usually to zero, and MAX AVAIL goes up again.


The above behaviour is because while they share the same threshold 
setting, the autoscaler only runs every minute, and it won't run when 
misplaced are over the threshold. Meanwhile, checks for the next PG to 
split happen much more frequently, so the balancer never wins that race.



We didn't know how long to expect it all to take, but decided that any 
improvement in PG size was worth starting. We now estimate it will 
take another 2-3 weeks to complete, for a total of 4-5 weeks total.


We have lost a drive or two during the process, and of course degraded 
objects went up, and more backfilling work got going. We paused splits 
for at least one of those, to make sure the degraded objects were 
sorted out as quick as possible. We can't be sure it went any faster 
though - there's always a long tail on that sort of thing.


Inconsistent objects are found at least a couple of times a week, and 
to get them repairing we disable scrubs, wait until they're stopped, 
then set the repair going and reenable scrubs. I don't know if this is 
special to the current higher splitting load, but we haven't noticed 
it before.


HTH,
Greg.


On 10/4/24 14:42, Eugen Block wrote:

Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as 
well, we've had good experience with that in the past, without the 
autoscaler. I just haven't dealt with such large PGs, I've been 
warning them for two years (when the PGs were only almost half this 
size) and now they finally started to listen. Well, they would still 
ignore it if it wouldn't impact all kinds of things now. ;-)


Thanks,
Eugen

Zitat von Janne Johansson :


Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :

I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:

PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG   
DISK_LOG    UP

86.3ff    277708  414403098409    0   0  3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]


If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small 

[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Eugen Block

Thank you for input!
We started the split with max_backfills = 1 and watched for a few  
minutes, then gradually increased it to 8. Now it's backfilling with  
around 180 MB/s, not really much but since client impact has to be  
avoided if possible, we decided to let that run for a couple of hours.  
Then reevaluate the situation and maybe increase the backfills a bit  
more.


Thanks!

Zitat von Gregory Orange :

We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs  
with NVME RocksDB, used exclusively for RGWs, holding about 60b  
objects. We are splitting for the same reason as you - improved  
balance. We also thought long and hard before we began, concerned  
about impact, stability etc.


We set target_max_misplaced_ratio to 0.1% initially, so we could  
retain some control and stop it again fairly quickly if we weren't  
happy with the behaviour. It also serves to limit the performance  
impact on the cluster, but unfortunately it also makes the whole  
process slower.


We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No  
issues with the cluster. We could go higher, but are not in a rush  
at this point. Sometimes nearfull osd warnings get high and MAX  
AVAIL on the data pool in `ceph df` gets low enough that we want to  
interrupt it. So, we set pg_num to whatever the current value is  
(ceph osd pool ls detail), and let it stabilise. Then the balancer  
gets to work once the misplaced objects drop below the ratio, and  
things balance out. Nearfull osds drop usually to zero, and MAX  
AVAIL goes up again.


The above behaviour is because while they share the same threshold  
setting, the autoscaler only runs every minute, and it won't run  
when misplaced are over the threshold. Meanwhile, checks for the  
next PG to split happen much more frequently, so the balancer never  
wins that race.



We didn't know how long to expect it all to take, but decided that  
any improvement in PG size was worth starting. We now estimate it  
will take another 2-3 weeks to complete, for a total of 4-5 weeks  
total.


We have lost a drive or two during the process, and of course  
degraded objects went up, and more backfilling work got going. We  
paused splits for at least one of those, to make sure the degraded  
objects were sorted out as quick as possible. We can't be sure it  
went any faster though - there's always a long tail on that sort of  
thing.


Inconsistent objects are found at least a couple of times a week,  
and to get them repairing we disable scrubs, wait until they're  
stopped, then set the repair going and reenable scrubs. I don't know  
if this is special to the current higher splitting load, but we  
haven't noticed it before.


HTH,
Greg.


On 10/4/24 14:42, Eugen Block wrote:

Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as  
well, we've had good experience with that in the past, without the  
autoscaler. I just haven't dealt with such large PGs, I've been  
warning them for two years (when the PGs were only almost half this  
size) and now they finally started to listen. Well, they would  
still ignore it if it wouldn't impact all kinds of things now. ;-)


Thanks,
Eugen

Zitat von Janne Johansson :


Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :

I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:

PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG
DISK_LOG    UP

86.3ff    277708  414403098409    0   0  3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]


If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small amount - wait for HEALTH_OK - bump some more it
will take a lot of calendar time, but have rather small impact. My
view of it is basically that this will be far less impactful than if
you lose a whole OSD, and hopefully your cluster can survive this
event, so it should be able to handle a slow trickle of PG splits too.

You can set a target number for the pool and let the autoscaler run a
few splits at a time, there are some settings to look at on how
aggressive the autoscaler will be, so it doesn't have to be
manual/scripted, but it's not very hard to script it if you are unsure
about the amount of work the autoscaler will start at any given time.



--
May the most significant bit of your life be positive.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Gregory Orange

System Administrator, Scientific 

[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Konstantin Shalygin


> On 10 Apr 2024, at 01:00, Eugen Block  wrote:
> 
> I appreciate your message, it really sounds tough (9 months, really?!). But 
> thanks for the reassurance :-)

Yes, the total "make this project great again" tooks 16 month, I think. This my 
work

First problem after 1M objects in PG was a deletion [1]. It's just impossible 
to delete objects for the 'stray' PG
The second was - the code, that cares about nearfull & backfillfull just don't 
work for this OSD [2], because code use DATA field (the objects), instead RAW 
field (the DATA + RocksDB database) for computations
The third was minor, but WTF statistics metric issue [3]
And the last but not least (and still present in master) - when lock object 
acquired, this crashes replica OSD's in acting set, when object is absent on 
primary OSD [4]. This may ruin client IO until OSD's restart & recovery

For current time, not all collection_list fixes was merged [5], but since 
14.2.22 much better than before...

> They don’t have any other options so we’ll have to start that process anyway, 
> probably tomorrow. We’ll see how it goes…


Yes, you just have to start, and then we’ll see


Thanks,
k

[1] https://tracker.ceph.com/issues/47044 + 
https://tracker.ceph.com/issues/45765 -> https://tracker.ceph.com/issues/50466
[2] https://tracker.ceph.com/issues/50533
[3] https://tracker.ceph.com/issues/52512
[4] https://tracker.ceph.com/issues/52513
[5] https://tracker.ceph.com/issues/58274
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Gregory Orange
We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs with 
NVME RocksDB, used exclusively for RGWs, holding about 60b objects. We 
are splitting for the same reason as you - improved balance. We also 
thought long and hard before we began, concerned about impact, stability 
etc.


We set target_max_misplaced_ratio to 0.1% initially, so we could retain 
some control and stop it again fairly quickly if we weren't happy with 
the behaviour. It also serves to limit the performance impact on the 
cluster, but unfortunately it also makes the whole process slower.


We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No 
issues with the cluster. We could go higher, but are not in a rush at 
this point. Sometimes nearfull osd warnings get high and MAX AVAIL on 
the data pool in `ceph df` gets low enough that we want to interrupt it. 
So, we set pg_num to whatever the current value is (ceph osd pool ls 
detail), and let it stabilise. Then the balancer gets to work once the 
misplaced objects drop below the ratio, and things balance out. Nearfull 
osds drop usually to zero, and MAX AVAIL goes up again.


The above behaviour is because while they share the same threshold 
setting, the autoscaler only runs every minute, and it won't run when 
misplaced are over the threshold. Meanwhile, checks for the next PG to 
split happen much more frequently, so the balancer never wins that race.



We didn't know how long to expect it all to take, but decided that any 
improvement in PG size was worth starting. We now estimate it will take 
another 2-3 weeks to complete, for a total of 4-5 weeks total.


We have lost a drive or two during the process, and of course degraded 
objects went up, and more backfilling work got going. We paused splits 
for at least one of those, to make sure the degraded objects were sorted 
out as quick as possible. We can't be sure it went any faster though - 
there's always a long tail on that sort of thing.


Inconsistent objects are found at least a couple of times a week, and to 
get them repairing we disable scrubs, wait until they're stopped, then 
set the repair going and reenable scrubs. I don't know if this is 
special to the current higher splitting load, but we haven't noticed it 
before.


HTH,
Greg.


On 10/4/24 14:42, Eugen Block wrote:

Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as well, 
we've had good experience with that in the past, without the autoscaler. 
I just haven't dealt with such large PGs, I've been warning them for two 
years (when the PGs were only almost half this size) and now they 
finally started to listen. Well, they would still ignore it if it 
wouldn't impact all kinds of things now. ;-)


Thanks,
Eugen

Zitat von Janne Johansson :


Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :

I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:

PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG   
DISK_LOG    UP

86.3ff    277708  414403098409    0   0  3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]


If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small amount - wait for HEALTH_OK - bump some more it
will take a lot of calendar time, but have rather small impact. My
view of it is basically that this will be far less impactful than if
you lose a whole OSD, and hopefully your cluster can survive this
event, so it should be able to handle a slow trickle of PG splits too.

You can set a target number for the pool and let the autoscaler run a
few splits at a time, there are some settings to look at on how
aggressive the autoscaler will be, so it doesn't have to be
manual/scripted, but it's not very hard to script it if you are unsure
about the amount of work the autoscaler will start at any given time.



--
May the most significant bit of your life be positive.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Gregory Orange

System Administrator, Scientific Platforms Team
Pawsey Supercomputing Centre, CSIRO
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Eugen Block

Thank you, Janne.
I believe the default 5% target_max_misplaced_ratio would work as  
well, we've had good experience with that in the past, without the  
autoscaler. I just haven't dealt with such large PGs, I've been  
warning them for two years (when the PGs were only almost half this  
size) and now they finally started to listen. Well, they would still  
ignore it if it wouldn't impact all kinds of things now. ;-)


Thanks,
Eugen

Zitat von Janne Johansson :


Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :

I'm trying to estimate the possible impact when large PGs are
splitted. Here's one example of such a PG:

PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG
DISK_LOGUP

86.3ff277708  4144030984090   0  3092
3092
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]


If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small amount - wait for HEALTH_OK - bump some more it
will take a lot of calendar time, but have rather small impact. My
view of it is basically that this will be far less impactful than if
you lose a whole OSD, and hopefully your cluster can survive this
event, so it should be able to handle a slow trickle of PG splits too.

You can set a target number for the pool and let the autoscaler run a
few splits at a time, there are some settings to look at on how
aggressive the autoscaler will be, so it doesn't have to be
manual/scripted, but it's not very hard to script it if you are unsure
about the amount of work the autoscaler will start at any given time.



--
May the most significant bit of your life be positive.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-10 Thread Janne Johansson
Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
> I'm trying to estimate the possible impact when large PGs are
> splitted. Here's one example of such a PG:
>
> PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOGUP
> 86.3ff277708  4144030984090   0  3092
> 3092
> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]

If you ask for small increases of pg_num, it will only split that many
PGs at a time, so while there will be a lot of data movement, (50% due
to half of the data needs to go to another newly made PG, and on top
of that, PGs per OSD will change, but also the balancing can now work
better) it will not be affecting the whole cluster if you increase
with say, 8 pg_nums at a time. As per the other reply, if you bump the
number with a small amount - wait for HEALTH_OK - bump some more it
will take a lot of calendar time, but have rather small impact. My
view of it is basically that this will be far less impactful than if
you lose a whole OSD, and hopefully your cluster can survive this
event, so it should be able to handle a slow trickle of PG splits too.

You can set a target number for the pool and let the autoscaler run a
few splits at a time, there are some settings to look at on how
aggressive the autoscaler will be, so it doesn't have to be
manual/scripted, but it's not very hard to script it if you are unsure
about the amount of work the autoscaler will start at any given time.



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-09 Thread Eugen Block

Hi,
I appreciate your message, it really sounds tough (9 months,  
really?!). But thanks for the reassurance :-)
They don’t have any other options so we’ll have to start that process  
anyway, probably tomorrow. We’ll see how it goes…


Zitat von Konstantin Shalygin :


Hi Eugene!

I have a case, where PG with millions of objects, like this

```
root@host# ./show_osd_pool_pg_usage.sh  | less | head
id  used_mbytes used_objects  omap_used_mbytes  omap_used_keys
--  ---     --
17.c91  1213.2482748031616  2539152   0 0
17.9ae  1213.3145303726196  2539025   0 0
17.1a4  1213.432228088379   2539752   0 0
17.8f4  1213.4958791732788  2539831   0 0
17.f9   1213.5339193344116  2539837   0 0
17.c9d  1213.564414024353   2540014   0 0
17.89   1213.6339054107666  2540183   0 0
17.412  1213.6393299102783  2539797   0 0
```

And OSD was very small, like 1TB with RocksDB ~150-200GB. Actually  
currently you see splitted PG. So one OSD was serve 64PG * 4M =  
256,000,000 of objects...


Main problem was - to remove something, you need to move something.  
While the move is in progress, nothing is deleted
Also, deleting is slower than writing. So one task for all  
operations was impossible. I do it manually for a 9 moths. After the  
splitting of the some PG was completed, I took other PG away from  
the most crowded (from the operator’s point of view, problematic)  
OSD. The pgremapper [1] helped me with this. As far as I remember,  
in this way I got from 2048 to 3000 PG, then I was able to set 4096  
PG, after which it became possible to move to 4TV NVME


Your case doesn't look that scary. Firstly, your 85% means that you  
have hundreds of free gigabytes (8TB's). If new data does not  
arrive, the reservation mechanism is sufficient and after some time  
the process will end. On the other hand, I had a replica, so  
compared to the EC - my case is a simpler


In any case, it’s worth trying and using the maximum capabilities of  
the upmap



Good luck,
k

[1] https://github.com/digitalocean/pgremapper


On 9 Apr 2024, at 11:39, Eugen Block  wrote:

I'm trying to estimate the possible impact when large PGs are  
splitted. Here's one example of such a PG:


PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG
DISK_LOGUP
86.3ff277708  4144030984090   0  3092   
3092 
[187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]


Their main application is RGW on EC (currently 1024 PGs on 240  
OSDs), 8TB HDDs backed by SSDs. There are 6 RGWs running behind  
HAProxies. It took me a while to convince them to do a PG split and  
now they're trying to assess how big the impact could be. The  
fullest OSD is already at 85% usage, the least filled one at 59%,  
so there is definitely room for a better balancing which, will be  
necessary until the new hardware arrives. The current distribution  
is around 100 PGs per OSD which usually would be fine, but since  
the PGs are that large only a few PGs difference have a huge impact  
on the OSD utilization.


I'm targeting 2048 PGs for that pool for now, probably do another  
split when the new hardware has been integrated.

Any comments are appreciated!



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-09 Thread Konstantin Shalygin
Hi Eugene!

I have a case, where PG with millions of objects, like this

```
root@host# ./show_osd_pool_pg_usage.sh  | less | head
id  used_mbytes used_objects  omap_used_mbytes  omap_used_keys
--  ---     --
17.c91  1213.2482748031616  2539152   0 0
17.9ae  1213.3145303726196  2539025   0 0
17.1a4  1213.432228088379   2539752   0 0
17.8f4  1213.4958791732788  2539831   0 0
17.f9   1213.5339193344116  2539837   0 0
17.c9d  1213.564414024353   2540014   0 0
17.89   1213.6339054107666  2540183   0 0
17.412  1213.6393299102783  2539797   0 0
```

And OSD was very small, like 1TB with RocksDB ~150-200GB. Actually currently 
you see splitted PG. So one OSD was serve 64PG * 4M = 256,000,000 of objects...

Main problem was - to remove something, you need to move something. While the 
move is in progress, nothing is deleted
Also, deleting is slower than writing. So one task for all operations was 
impossible. I do it manually for a 9 moths. After the splitting of the some PG 
was completed, I took other PG away from the most crowded (from the operator’s 
point of view, problematic) OSD. The pgremapper [1] helped me with this. As far 
as I remember, in this way I got from 2048 to 3000 PG, then I was able to set 
4096 PG, after which it became possible to move to 4TV NVME

Your case doesn't look that scary. Firstly, your 85% means that you have 
hundreds of free gigabytes (8TB's). If new data does not arrive, the 
reservation mechanism is sufficient and after some time the process will end. 
On the other hand, I had a replica, so compared to the EC - my case is a simpler

In any case, it’s worth trying and using the maximum capabilities of the upmap


Good luck,
k

[1] https://github.com/digitalocean/pgremapper

> On 9 Apr 2024, at 11:39, Eugen Block  wrote:
> 
> I'm trying to estimate the possible impact when large PGs are splitted. 
> Here's one example of such a PG:
> 
> PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOGUP
> 86.3ff277708  4144030984090   0  3092  3092
> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]
> 
> Their main application is RGW on EC (currently 1024 PGs on 240 OSDs), 8TB 
> HDDs backed by SSDs. There are 6 RGWs running behind HAProxies. It took me a 
> while to convince them to do a PG split and now they're trying to assess how 
> big the impact could be. The fullest OSD is already at 85% usage, the least 
> filled one at 59%, so there is definitely room for a better balancing which, 
> will be necessary until the new hardware arrives. The current distribution is 
> around 100 PGs per OSD which usually would be fine, but since the PGs are 
> that large only a few PGs difference have a huge impact on the OSD 
> utilization.
> 
> I'm targeting 2048 PGs for that pool for now, probably do another split when 
> the new hardware has been integrated.
> Any comments are appreciated!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io