[ceph-users] Re: Modify pgp number after pg_num increased

2021-09-22 Thread Eugen Block

Hi,

IIRC in a different thread you pasted your max-backfill config and it  
was the lowest possible value (1), right? That's why your backfill is  
slow.



Zitat von "Szabo, Istvan (Agoda)" :


Hi,

By default in the newer versions of ceph when you increase the  
pg_num the cluster will start to increase the pgp_num slowly up to  
the value of the pg_num.
I've increased the ec-code data pool from 32 to 128 but 1 node has  
been added to the cluster and it's very slow.


pool 28 'hkg.rgw.buckets.data' erasure profile data-ec size 6  
min_size 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 55  
pgp_num_target 128 autoscale_mode warn last_change 16443 lfor  
0/0/14828 fl

ags hashpspool stripe_width 16384 application rgw

At the moment there has been done 55 out of the 128 pg.
Is it safe to set the pgp_num at this stage to 64 and wait until the  
data will be rebalanced to the newly added node?


Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Eugen Block

Thanks for the summary, Dan!

I'm still hesitating upgrading our production environment from N to O,  
your experience sounds reassuring though. I have one question, did you  
also switch to cephadm and containerize all daemons? We haven't made a  
decision yet, but I guess at some point we'll have to switch anyway,  
so we could also just get over it. :-D We'll need to discuss it with  
the team...


Thanks,
Eugen


Zitat von Dan van der Ster :


Dear friends,

This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14,
successfully, following the procedure at
https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus
It's a 400TB cluster which is 10% used with 72 osds (block=hdd,
block.db=ssd) and 40M objects.

* The mons upgraded cleanly as expected.
* One minor surprise was that the mgrs respawned themselves moments
after the leader restarted into octopus:

2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0)
16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons
cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256
in quorum (ranks 0,1,2)

2021-09-21 10:16:39.046 7fae3caf8700  1 mgr handle_mgr_map respawning
because set of enabled modules changed!

This didn't create any problems AFAICT.

* The osds performed the expected fsck after restarting. Their logs
are spammed with things like

2021-09-21T11:15:23.233+0200 7f85901bd700 -1
bluestore(/var/lib/ceph/osd/ceph-1) fsck warning:
#174:1e024a6e:::10009663a55.:head# has omap that is not
per-pool or pgmeta

but that is fully expected AFAIU. Each osd took just under 10  
minutes to fsck:


2021-09-21T11:22:27.188+0200 7f85a3a2bf00  1
bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0
errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056
seconds

For reference, this cluster was created many major releases ago (maybe
firefly) but osds were probably re-created in luminous.
The memory usage was quite normal, we didn't suffer any OOMs.

* The active mds restarted into octopus without incident.

In summary it was a very smooth upgrade. After a week of observation
we'll proceed with more production clusters.
For our largest S3 cluster with slow hdds, we expect huge fsck
transactions, so will wait for https://github.com/ceph/ceph/pull/42958
to be merged before upgrading.

Best Regards, and thanks to all the devs for their work,

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Balancer vs. Autoscaler

2021-09-22 Thread Jan-Philipp Litza
Hi everyone,

I had the autoscale_mode set to "on" and the autoscaler went to work and
started adjusting the number of PGs in that pool. Since this implies a
huge shift in data, the reweights that the balancer had carefully
adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
become nearful (we sadly have very different sized OSDs).

Now apparently both manager modules, balancer and pg_autoscaler, have
the same threshold for operation, namely target_max_misplaced_ratio. So
the balancer won't become active as long as the pg_autoscaler is still
adjusting the number of PGs.

I already set the autoscale_mode to "warn" on all pools, but apparently
the autoscaler is determined to finish what it started.

Is there any way to pause the autoscaler so the balancer has a chance of
fixing the reweights? Because even in manual mode (ceph balancer
optimize), the balancer won't compute a plan when the misplaced ratio is
higher than target_max_misplaced_ratio.

I know about "ceph osd reweight-*", but they adjust the reweights
(visible in "ceph osd tree"), whereas the balancer adjusts the "compat
weight-set", which I don't know how to convert back to the old-style
reweights.

Best regards,
Jan-Philipp
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Dan van der Ster
Hi Eugen,

All of our prod clusters are still old school rpm packages managed by
our private puppet manifests. Even our newest pacific pre-prod cluster
is still managed like that.
We have a side project to test and move to cephadm / containers but
that is still a WIP. (Our situation is complicated by the fact that
we'll need to continue puppet managing things like firewall with
cephadm doing the daemon placement).

Cheers, Dan


On Wed, Sep 22, 2021 at 10:32 AM Eugen Block  wrote:
>
> Thanks for the summary, Dan!
>
> I'm still hesitating upgrading our production environment from N to O,
> your experience sounds reassuring though. I have one question, did you
> also switch to cephadm and containerize all daemons? We haven't made a
> decision yet, but I guess at some point we'll have to switch anyway,
> so we could also just get over it. :-D We'll need to discuss it with
> the team...
>
> Thanks,
> Eugen
>
>
> Zitat von Dan van der Ster :
>
> > Dear friends,
> >
> > This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14,
> > successfully, following the procedure at
> > https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus
> > It's a 400TB cluster which is 10% used with 72 osds (block=hdd,
> > block.db=ssd) and 40M objects.
> >
> > * The mons upgraded cleanly as expected.
> > * One minor surprise was that the mgrs respawned themselves moments
> > after the leader restarted into octopus:
> >
> > 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0)
> > 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons
> > cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256
> > in quorum (ranks 0,1,2)
> >
> > 2021-09-21 10:16:39.046 7fae3caf8700  1 mgr handle_mgr_map respawning
> > because set of enabled modules changed!
> >
> > This didn't create any problems AFAICT.
> >
> > * The osds performed the expected fsck after restarting. Their logs
> > are spammed with things like
> >
> > 2021-09-21T11:15:23.233+0200 7f85901bd700 -1
> > bluestore(/var/lib/ceph/osd/ceph-1) fsck warning:
> > #174:1e024a6e:::10009663a55.:head# has omap that is not
> > per-pool or pgmeta
> >
> > but that is fully expected AFAIU. Each osd took just under 10
> > minutes to fsck:
> >
> > 2021-09-21T11:22:27.188+0200 7f85a3a2bf00  1
> > bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0
> > errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056
> > seconds
> >
> > For reference, this cluster was created many major releases ago (maybe
> > firefly) but osds were probably re-created in luminous.
> > The memory usage was quite normal, we didn't suffer any OOMs.
> >
> > * The active mds restarted into octopus without incident.
> >
> > In summary it was a very smooth upgrade. After a week of observation
> > we'll proceed with more production clusters.
> > For our largest S3 cluster with slow hdds, we expect huge fsck
> > transactions, so will wait for https://github.com/ceph/ceph/pull/42958
> > to be merged before upgrading.
> >
> > Best Regards, and thanks to all the devs for their work,
> >
> > Dan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Dan van der Ster
Hi Andras,

I'm not aware of any showstoppers to move directly to pacific. Indeed
we already run pacific on a new cluster we built for our users to try
cephfs snapshots at scale. That cluster was created with octopus a few
months ago then upgraded to pacific at 16.2.4 to take advantage of the
stray dentry splitting.

Why octopus and not pacific directly for the existing bulk of our prod
clusters? we're just being conservative, especially for what concerns
the fsck omap upgrade on all the osds. Since it went well for this
cluster, I expect it will similarly go well for the other rbd and
cephfs clusters. We'll tread more carefully for the S3 clusters, but
with the PR mentioned earlier I expect it to go well.

My expectation is that we'll only run octopus for a short while before
we move to pacific in one of the next point releases there.
Before octopus we usually haven't moved our most critical clusters to
the next major release until around ~.8 -- it's usually by then that
all major issues have been flushed out, AFAICT.

Cheers, Dan


On Wed, Sep 22, 2021 at 11:19 AM Andras Pataki
 wrote:
>
> Hi Dan,
>
> This is excellent to hear - we've also been a bit hesitant to upgrade
> from Nautilus (which has been working so well for us).  One question:
> did you/would you consider upgrading straight to Pacific from Nautilus?
> Can you share your thoughts that lead you to Octopus first?
>
> Thanks,
>
> Andras
>
>
> On 9/21/21 06:09, Dan van der Ster wrote:
> > Dear friends,
> >
> > This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14,
> > successfully, following the procedure at
> > https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus
> > It's a 400TB cluster which is 10% used with 72 osds (block=hdd,
> > block.db=ssd) and 40M objects.
> >
> > * The mons upgraded cleanly as expected.
> > * One minor surprise was that the mgrs respawned themselves moments
> > after the leader restarted into octopus:
> >
> > 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0)
> > 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons
> > cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256
> > in quorum (ranks 0,1,2)
> >
> > 2021-09-21 10:16:39.046 7fae3caf8700  1 mgr handle_mgr_map respawning
> > because set of enabled modules changed!
> >
> > This didn't create any problems AFAICT.
> >
> > * The osds performed the expected fsck after restarting. Their logs
> > are spammed with things like
> >
> > 2021-09-21T11:15:23.233+0200 7f85901bd700 -1
> > bluestore(/var/lib/ceph/osd/ceph-1) fsck warning:
> > #174:1e024a6e:::10009663a55.:head# has omap that is not
> > per-pool or pgmeta
> >
> > but that is fully expected AFAIU. Each osd took just under 10 minutes to 
> > fsck:
> >
> > 2021-09-21T11:22:27.188+0200 7f85a3a2bf00  1
> > bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0
> > errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056
> > seconds
> >
> > For reference, this cluster was created many major releases ago (maybe
> > firefly) but osds were probably re-created in luminous.
> > The memory usage was quite normal, we didn't suffer any OOMs.
> >
> > * The active mds restarted into octopus without incident.
> >
> > In summary it was a very smooth upgrade. After a week of observation
> > we'll proceed with more production clusters.
> > For our largest S3 cluster with slow hdds, we expect huge fsck
> > transactions, so will wait for https://github.com/ceph/ceph/pull/42958
> > to be merged before upgrading.
> >
> > Best Regards, and thanks to all the devs for their work,
> >
> > Dan
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Why set osd flag to noout during upgrade ?

2021-09-22 Thread Francois Legrand

Hello everybody,

I have a "stupid" question. Why is it recommended in the docs to set the 
osd flag to noout during an upgrade/maintainance (and especially during 
an osd upgrade/maintainance) ?


In my understanding, if an osd goes down, after a while (600s by 
default) it's marked out and the cluster will start to rebuild it's 
content elsewhere in the cluster to maintain the redondancy of the 
datas. This generate some transfer and load on other osds, but that's 
not a big deal !


As soon as the osd is back, it's marked in again and ceph is able to 
determine which data is back and stop the recovery to reuse the 
unchanged datas which are back. Generally, the recovery is as fast as 
with noout flag (because with noout, the data modified during the down 
period still have be copied to the back osd).


Thus is there an other reason apart from limiting the transfer and 
others osds load durind the downtime ?


F

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] High overwrite latency

2021-09-22 Thread Erwin Ceph
Hi,

We do run several Ceph clusters, but one has a strange problem.

It is running Octopus 15.2.14 on 9 (HP 360 Gen 8, 64 GB, 10 Gbps) servers, 48 
OSDs (all 2 TB Samsung SSDs with Bluestore). Monitoring in Grafana shows these 
three latency values 
over 7 days:

ceph_osd_op_r_latency_sum: avg 1.16 ms, max 9.95 ms
ceph_osd_op_w_latency_sum: avg 5.85 ms, max 26.2 ms
ceph_osd_op_rw_latency_sum: avf 110 ms, max 388 ms

Average throughput is around 30 MB/sec read and 40 MB/sec write. Both with 2000 
iops.

On another cluster (hardware almost the same, identical software versions), but 
25% lower load, there the values are:

ceph_osd_op_r_latency_sum: avg 1.09 ms, max 6.55 ms
ceph_osd_op_w_latency_sum: avg 4.46 ms, max 14.4 ms
ceph_osd_op_rw_latency_sum: avf 4.94 ms, max 17.6 ms

I can't find any difference in hba controller settings, network or 
kerneltuning. Has someone got any ideas?

Regards,
Erwin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Eugen Block

I understand, thanks for sharing!


Zitat von Dan van der Ster :


Hi Eugen,

All of our prod clusters are still old school rpm packages managed by
our private puppet manifests. Even our newest pacific pre-prod cluster
is still managed like that.
We have a side project to test and move to cephadm / containers but
that is still a WIP. (Our situation is complicated by the fact that
we'll need to continue puppet managing things like firewall with
cephadm doing the daemon placement).

Cheers, Dan


On Wed, Sep 22, 2021 at 10:32 AM Eugen Block  wrote:


Thanks for the summary, Dan!

I'm still hesitating upgrading our production environment from N to O,
your experience sounds reassuring though. I have one question, did you
also switch to cephadm and containerize all daemons? We haven't made a
decision yet, but I guess at some point we'll have to switch anyway,
so we could also just get over it. :-D We'll need to discuss it with
the team...

Thanks,
Eugen


Zitat von Dan van der Ster :

> Dear friends,
>
> This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14,
> successfully, following the procedure at
>  
https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus

> It's a 400TB cluster which is 10% used with 72 osds (block=hdd,
> block.db=ssd) and 40M objects.
>
> * The mons upgraded cleanly as expected.
> * One minor surprise was that the mgrs respawned themselves moments
> after the leader restarted into octopus:
>
> 2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0)
> 16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons
>  
cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256

> in quorum (ranks 0,1,2)
>
> 2021-09-21 10:16:39.046 7fae3caf8700  1 mgr handle_mgr_map respawning
> because set of enabled modules changed!
>
> This didn't create any problems AFAICT.
>
> * The osds performed the expected fsck after restarting. Their logs
> are spammed with things like
>
> 2021-09-21T11:15:23.233+0200 7f85901bd700 -1
> bluestore(/var/lib/ceph/osd/ceph-1) fsck warning:
> #174:1e024a6e:::10009663a55.:head# has omap that is not
> per-pool or pgmeta
>
> but that is fully expected AFAIU. Each osd took just under 10
> minutes to fsck:
>
> 2021-09-21T11:22:27.188+0200 7f85a3a2bf00  1
> bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0
> errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056
> seconds
>
> For reference, this cluster was created many major releases ago (maybe
> firefly) but osds were probably re-created in luminous.
> The memory usage was quite normal, we didn't suffer any OOMs.
>
> * The active mds restarted into octopus without incident.
>
> In summary it was a very smooth upgrade. After a week of observation
> we'll proceed with more production clusters.
> For our largest S3 cluster with slow hdds, we expect huge fsck
> transactions, so will wait for https://github.com/ceph/ceph/pull/42958
> to be merged before upgrading.
>
> Best Regards, and thanks to all the devs for their work,
>
> Dan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Etienne Menguy
Hello,

From my experience, I see three reasons : 
- You don’t want to recover data if you already have them on a down OSD, 
rebalancing can have a big impact on performance
- If upgrade/maintenance goes wrong you will want to focus on this issue and 
not have to deal with things done by Ceph meanwhile. 
- During an upgrade you have an ‘unusual’ cluster with different versions. It’s 
supposed to work, but you probably want to keep it ‘boring’.

-
Etienne Menguy
etienne.men...@croit.io




> On 22 Sep 2021, at 11:51, Francois Legrand  wrote:
> 
> Hello everybody,
> 
> I have a "stupid" question. Why is it recommended in the docs to set the osd 
> flag to noout during an upgrade/maintainance (and especially during an osd 
> upgrade/maintainance) ?
> 
> In my understanding, if an osd goes down, after a while (600s by default) 
> it's marked out and the cluster will start to rebuild it's content elsewhere 
> in the cluster to maintain the redondancy of the datas. This generate some 
> transfer and load on other osds, but that's not a big deal !
> 
> As soon as the osd is back, it's marked in again and ceph is able to 
> determine which data is back and stop the recovery to reuse the unchanged 
> datas which are back. Generally, the recovery is as fast as with noout flag 
> (because with noout, the data modified during the down period still have be 
> copied to the back osd).
> 
> Thus is there an other reason apart from limiting the transfer and others 
> osds load durind the downtime ?
> 
> F
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Dan van der Ster
Yeah you don't want to deal with backfilling while the cluster is
upgrading. At best it can delay the upgrade, at worst mixed version
backfilling has (rarely) caused issues in the past.

We additionally `set noin` and disable the balancer: `ceph balancer off`.
The former prevents broken osds from re-entering the cluster, and both of
these similarly prevent backfilling from starting mid-upgrade.


.. Dan


On Wed, 22 Sep 2021, 12:18 Etienne Menguy,  wrote:

> Hello,
>
> From my experience, I see three reasons :
> - You don’t want to recover data if you already have them on a down OSD,
> rebalancing can have a big impact on performance
> - If upgrade/maintenance goes wrong you will want to focus on this issue
> and not have to deal with things done by Ceph meanwhile.
> - During an upgrade you have an ‘unusual’ cluster with different versions.
> It’s supposed to work, but you probably want to keep it ‘boring’.
>
> -
> Etienne Menguy
> etienne.men...@croit.io
>
>
>
>
> > On 22 Sep 2021, at 11:51, Francois Legrand  wrote:
> >
> > Hello everybody,
> >
> > I have a "stupid" question. Why is it recommended in the docs to set the
> osd flag to noout during an upgrade/maintainance (and especially during an
> osd upgrade/maintainance) ?
> >
> > In my understanding, if an osd goes down, after a while (600s by
> default) it's marked out and the cluster will start to rebuild it's content
> elsewhere in the cluster to maintain the redondancy of the datas. This
> generate some transfer and load on other osds, but that's not a big deal !
> >
> > As soon as the osd is back, it's marked in again and ceph is able to
> determine which data is back and stop the recovery to reuse the unchanged
> datas which are back. Generally, the recovery is as fast as with noout flag
> (because with noout, the data modified during the down period still have be
> copied to the back osd).
> >
> > Thus is there an other reason apart from limiting the transfer and
> others osds load durind the downtime ?
> >
> > F
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?

2021-09-22 Thread Kai Stian Olstad

On 21.09.2021 09:11, Kobi Ginon wrote:

for sure the balancer affects the status


Of course, but setting several PG to degraded is something else.



i doubt that your customers will be writing so many objects in the same
rate of the Test.


I only need 2 host running rados bench to get several PG in degrade 
state.




maybe you need to play with the balancer configuration a bit.


Maybe, but a balancer should not set the cluster health to warning with 
several PG in degraded state.
It should be possible to do this cleanly, copy data and delete the 
source when copy is OK.




Could start with this
The balancer mode can be changed to crush-compat mode, which is 
backward

compatible with older clients, and will make small changes to the data
distribution over time to ensure that OSDs are equally utilized.
https://docs.ceph.com/en/latest/rados/operations/balancer/


I will probably just turn it off before I set the cluster in production.



side note: i m using indeed an old version of ceph ( nautilus)+ blancer
configured
and runs rado benchmarks , but did not saw such a problem.
on the other hand i m not using pg_autoscaler
i set the pools PG number in advanced according to assumption of the
percentage each pool will be using
Could be that you do use this Mode and the combination of auto scaler 
and

balancer is what reveals this issue


If you look at my initial post you will se that the pool is created with 
--autoscale-mode=off
The cluster is running 16.2.5 and is empty except for one pool with one 
PG created by Cephadm.



--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer vs. Autoscaler

2021-09-22 Thread Dan van der Ster
To get an idea how much work is left, take a look at `ceph osd pool ls
detail`. There should be pg_num_target... The osds will merge or split PGs
until pg_num matches that value.

.. Dan


On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza,  wrote:

> Hi everyone,
>
> I had the autoscale_mode set to "on" and the autoscaler went to work and
> started adjusting the number of PGs in that pool. Since this implies a
> huge shift in data, the reweights that the balancer had carefully
> adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
> become nearful (we sadly have very different sized OSDs).
>
> Now apparently both manager modules, balancer and pg_autoscaler, have
> the same threshold for operation, namely target_max_misplaced_ratio. So
> the balancer won't become active as long as the pg_autoscaler is still
> adjusting the number of PGs.
>
> I already set the autoscale_mode to "warn" on all pools, but apparently
> the autoscaler is determined to finish what it started.
>
> Is there any way to pause the autoscaler so the balancer has a chance of
> fixing the reweights? Because even in manual mode (ceph balancer
> optimize), the balancer won't compute a plan when the misplaced ratio is
> higher than target_max_misplaced_ratio.
>
> I know about "ceph osd reweight-*", but they adjust the reweights
> (visible in "ceph osd tree"), whereas the balancer adjusts the "compat
> weight-set", which I don't know how to convert back to the old-style
> reweights.
>
> Best regards,
> Jan-Philipp
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Modify pgp number after pg_num increased

2021-09-22 Thread Szabo, Istvan (Agoda)
Hi,

By default in the newer versions of ceph when you increase the pg_num the 
cluster will start to increase the pgp_num slowly up to the value of the pg_num.
I've increased the ec-code data pool from 32 to 128 but 1 node has been added 
to the cluster and it's very slow.

pool 28 'hkg.rgw.buckets.data' erasure profile data-ec size 6 min_size 5 
crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 55 pgp_num_target 128 
autoscale_mode warn last_change 16443 lfor 0/0/14828 fl
ags hashpspool stripe_width 16384 application rgw

At the moment there has been done 55 out of the 128 pg.
Is it safe to set the pgp_num at this stage to 64 and wait until the data will 
be rebalanced to the newly added node?

Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Successful Upgrade from 14.2.22 to 15.2.14

2021-09-22 Thread Andras Pataki

Hi Dan,

This is excellent to hear - we've also been a bit hesitant to upgrade 
from Nautilus (which has been working so well for us).  One question: 
did you/would you consider upgrading straight to Pacific from Nautilus?  
Can you share your thoughts that lead you to Octopus first?


Thanks,

Andras


On 9/21/21 06:09, Dan van der Ster wrote:

Dear friends,

This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14,
successfully, following the procedure at
https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus
It's a 400TB cluster which is 10% used with 72 osds (block=hdd,
block.db=ssd) and 40M objects.

* The mons upgraded cleanly as expected.
* One minor surprise was that the mgrs respawned themselves moments
after the leader restarted into octopus:

2021-09-21T10:16:38.992219+0200 mon.cephdwight-mon-1633994557 (mon.0)
16 : cluster [INF] mon.cephdwight-mon-1633994557 is new leader, mons
cephdwight-mon-1633994557,cephdwight-mon-f7df6839c6,cephdwight-mon-d8788e3256
in quorum (ranks 0,1,2)

2021-09-21 10:16:39.046 7fae3caf8700  1 mgr handle_mgr_map respawning
because set of enabled modules changed!

This didn't create any problems AFAICT.

* The osds performed the expected fsck after restarting. Their logs
are spammed with things like

2021-09-21T11:15:23.233+0200 7f85901bd700 -1
bluestore(/var/lib/ceph/osd/ceph-1) fsck warning:
#174:1e024a6e:::10009663a55.:head# has omap that is not
per-pool or pgmeta

but that is fully expected AFAIU. Each osd took just under 10 minutes to fsck:

2021-09-21T11:22:27.188+0200 7f85a3a2bf00  1
bluestore(/var/lib/ceph/osd/ceph-1) _fsck_on_open <<>> with 0
errors, 197756 warnings, 197756 repaired, 0 remaining in 475.083056
seconds

For reference, this cluster was created many major releases ago (maybe
firefly) but osds were probably re-created in luminous.
The memory usage was quite normal, we didn't suffer any OOMs.

* The active mds restarted into octopus without incident.

In summary it was a very smooth upgrade. After a week of observation
we'll proceed with more production clusters.
For our largest S3 cluster with slow hdds, we expect huge fsck
transactions, so will wait for https://github.com/ceph/ceph/pull/42958
to be merged before upgrading.

Best Regards, and thanks to all the devs for their work,

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Change max backfills

2021-09-22 Thread Pascal Weißhaupt
Hi,



I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max 
backfills via 



ceph tell 'osd.*' injectargs '--osd-max-backfills 1'



I get no output:



root@pve01:~# ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
osd.0: {}
osd.1: {}
osd.2: {}
osd.3: {}
osd.4: {}
osd.5: {}
osd.6: {}
osd.7: {}
osd.8: {}
osd.9: {}
osd.10: {}
osd.11: {}
osd.12: {}
osd.13: {}
osd.14: {}
osd.15: {}
osd.16: {}
osd.17: {}
osd.18: {}
osd.19: {}



If I remember correctly, with Ceph 15 I got something like "changed max 
backfills to 1" or so.



Is that command not supported anymore or is the empty output correct?



Regards,

Pascal
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Change max backfills

2021-09-22 Thread Etienne Menguy
Hi,

In the past you had this output if value was not changing, try with another 
value.
I don’t know if things changed with latest Ceph version.

-
Etienne Menguy
etienne.men...@croit.io




> On 22 Sep 2021, at 15:34, Pascal Weißhaupt  
> wrote:
> 
> Hi,
> 
> 
> 
> I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max 
> backfills via 
> 
> 
> 
> ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
> 
> 
> 
> I get no output:
> 
> 
> 
> root@pve01:~# ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
> osd.0: {}
> osd.1: {}
> osd.2: {}
> osd.3: {}
> osd.4: {}
> osd.5: {}
> osd.6: {}
> osd.7: {}
> osd.8: {}
> osd.9: {}
> osd.10: {}
> osd.11: {}
> osd.12: {}
> osd.13: {}
> osd.14: {}
> osd.15: {}
> osd.16: {}
> osd.17: {}
> osd.18: {}
> osd.19: {}
> 
> 
> 
> If I remember correctly, with Ceph 15 I got something like "changed max 
> backfills to 1" or so.
> 
> 
> 
> Is that command not supported anymore or is the empty output correct?
> 
> 
> 
> Regards,
> 
> Pascal
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Change max backfills

2021-09-22 Thread Pascal Weißhaupt
God damn...you are absolutely right - my bad.



Sorry and thanks for that...



-Ursprüngliche Nachricht-
Von: Etienne Menguy 
Gesendet: Mittwoch 22. September 2021 15:48
An: ceph-users@ceph.io
Betreff: [ceph-users] Re: Change max backfills



Hi,

In the past you had this output if value was not changing, try with another 
value.
I don’t know if things changed with latest Ceph version.

-
Etienne Menguy
etienne.men...@croit.io




> On 22 Sep 2021, at 15:34, Pascal Weißhaupt  
> wrote:
> 
> Hi,
> 
> 
> 
> I recently upgraded from Ceph 15 to Ceph 16 and when I want to change the max 
> backfills via 
> 
> 
> 
> ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
> 
> 
> 
> I get no output:
> 
> 
> 
> root@pve01:˜# ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
> osd.0: {}
> osd.1: {}
> osd.2: {}
> osd.3: {}
> osd.4: {}
> osd.5: {}
> osd.6: {}
> osd.7: {}
> osd.8: {}
> osd.9: {}
> osd.10: {}
> osd.11: {}
> osd.12: {}
> osd.13: {}
> osd.14: {}
> osd.15: {}
> osd.16: {}
> osd.17: {}
> osd.18: {}
> osd.19: {}
> 
> 
> 
> If I remember correctly, with Ceph 15 I got something like "changed max 
> backfills to 1" or so.
> 
> 
> 
> Is that command not supported anymore or is the empty output correct?
> 
> 
> 
> Regards,
> 
> Pascal
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Modify pgp number after pg_num increased

2021-09-22 Thread Szabo, Istvan (Agoda)
That's been already increased to 4.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block  
Sent: Wednesday, September 22, 2021 2:51 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Modify pgp number after pg_num increased

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

IIRC in a different thread you pasted your max-backfill config and it was the 
lowest possible value (1), right? That's why your backfill is slow.


Zitat von "Szabo, Istvan (Agoda)" :

> Hi,
>
> By default in the newer versions of ceph when you increase the pg_num 
> the cluster will start to increase the pgp_num slowly up to the value 
> of the pg_num.
> I've increased the ec-code data pool from 32 to 128 but 1 node has 
> been added to the cluster and it's very slow.
>
> pool 28 'hkg.rgw.buckets.data' erasure profile data-ec size 6 min_size 
> 5 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 55 
> pgp_num_target 128 autoscale_mode warn last_change 16443 lfor
> 0/0/14828 fl
> ags hashpspool stripe_width 16384 application rgw
>
> At the moment there has been done 55 out of the 128 pg.
> Is it safe to set the pgp_num at this stage to 64 and wait until the 
> data will be rebalanced to the newly added node?
>
> Thank you
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] IO500 SC’21 Call for Submission

2021-09-22 Thread IO500 Committee

Stabilization period: Friday, 17th September - Friday, 1st October

Submission deadline: Monday, 1st November 2021 AoE

The IO500 [1] is now accepting and encouraging submissions for the 
upcoming 9th semi-annual IO500 list, in conjunction with SC'21. Once 
again, we are also accepting submissions to the 10 Node Challenge to 
encourage the submission of small-scale results. The new ranked lists 
will be announced via live-stream at a virtual session during "The IO500 
and the Virtual Institute of I/O" BoF [3]. We hope to see many new 
results.


What's New
Since ISC21, the IO500 follows a two-staged approach. First, there will 
be a two-week stabilization period during which we encourage the 
community to verify that the benchmark runs properly on a variety of 
storage systems. During this period the benchmark may be updated based 
upon feedback from the community. The final benchmark will then be 
released. We expect that submissions compliant with the rules made 
during the stabilization period will be valid as a final submission 
unless a significant defect is found.


We are now creating a more detailed schema to describe the hardware and 
software of the system under test and provide the first set of tools to 
ease capturing of this information for inclusion with the submission. 
Further details will be released on the submission page [2].


We are evaluating the inclusion of optional test phases for additional 
key workloads - split easy/hard find phases, 4KB and 1MB random 
read/write phases, and concurrent metadata operations. This is called an 
extended run. At the moment, we collect the information to verify that 
additional phases do not significantly impact the results of the 
standard IO500 run. We encourage every participant to submit results 
from both a standard run and an extended run to facilitate comparisons 
between the existing and new benchmark phases. In a future release, we 
may include some or all of these results as part of the standard 
benchmark. The extended results are not currently included in the 
scoring of any ranked list.

Background

The benchmark suite is designed to be easy to run and the community has 
multiple active support channels to help with any questions. Please note 
that submissions of all sizes are welcome; the site has customizable 
sorting, so it is possible to submit on a small system and still get a 
very good per-client score, for example. Additionally, the list is about 
much more than just the raw rank; all submissions help the community by 
collecting and publishing a wider corpus of data. More details below.


Following the success of the Top500 in collecting and analyzing 
historical trends in supercomputer technology and evolution, the IO500 
was created in 2017, published its first list at SC17, and has grown 
continually since then. The need for such an initiative has long been 
known within High-Performance Computing; however, defining appropriate 
benchmarks has long been challenging. Despite this challenge, the 
community, after long and spirited discussion, finally reached a 
consensus on a suite of benchmarks and a metric for resolving the scores 
into a single ranking.


The multi-fold goals of the benchmark suite are as follows:
Maximizing simplicity in running the benchmark suite
Encouraging optimization and documentation of tuning parameters for 
performance

Allowing submitters to highlight their "hero run" performance numbers
Forcing submitters to simultaneously report performance for challenging 
IO patterns.
Specifically, the benchmark suite includes a hero-run of both IOR and 
MDTest configured, however, possible to maximize performance and 
establish an upper-bound for performance. It also includes an IOR and 
MDTest run with highly constrained parameters forcing a difficult usage 
pattern in an attempt to determine a lower-bound. Finally, it includes a 
namespace search as this has been determined to be a highly sought-after 
feature in HPC storage systems that has historically not been 
well-measured. Submitters are encouraged to share their tuning insights 
for publication.


The goals of the community are also multi-fold:
Gather historical data for the sake of analysis and to aid predictions 
of storage futures
Collect tuning data to share valuable performance optimizations across 
the community
Encourage vendors and designers to optimize for workloads beyond "hero 
runs"

Establish bounded expectations for users, procurers, and administrators

10 Node I/O Challenge
The 10 Node Challenge is conducted using the regular IO500 benchmark, 
however, with the rule that exactly 10 client nodes must be used to run 
the benchmark. You may use any shared storage with any number of 
servers. When submitting for the IO500 list, you can opt-in for 
"Participate in the 10 compute node challenge only", then we will not 
include the results in the ranked list. Other 10-node node submissions 
will be included in the full list and in the ranke

[ceph-users] "Remaining time" under-estimates by 100x....

2021-09-22 Thread Harry G. Coin
Is there a way to re-calibrate the various 'global recovery event' and 
related 'remaining time' estimators?


For the last three days I've been assured that a 19h event will be over 
in under 3 hours...


Previously I think Microsoft held the record for the most incorrect 
'please wait' progress indicators.  Ceph may take that crown this year, 
unless...


Thanks

Harry


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
We'd worked on pushing a change to fix
https://tracker.ceph.com/issues/50526 for a deadlock in remoto here:
https://github.com/alfredodeza/remoto/pull/63

A new version, 1.2.1, was built to help with this. With the Ceph
release 16.2.6 (at least), we see 1.1.4 is again part of the
containers. Looking at EPEL8, all that is built now is 1.1.4. We're
not sure what happened, but would it be possible to get 1.2.1 pushed
to EPEL8 again, and figure out why it was removed? We'd then need a
rebuild of the 16.2.6 containers to 'fix' this bug.

This is definitely a high urgency bug, as it impacts any deployments
with medium to large counts of OSDs or split db/wal devices, like many
modern deployments.

https://koji.fedoraproject.org/koji/packageinfo?packageID=18747
https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/

Respectfully,
David Orman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
I'm wondering if this was installed using pip/pypi before, and now
switched to using EPEL? That would explain it - 1.2.1 may never have
been pushed to EPEL.

David

On Wed, Sep 22, 2021 at 11:26 AM David Orman  wrote:
>
> We'd worked on pushing a change to fix
> https://tracker.ceph.com/issues/50526 for a deadlock in remoto here:
> https://github.com/alfredodeza/remoto/pull/63
>
> A new version, 1.2.1, was built to help with this. With the Ceph
> release 16.2.6 (at least), we see 1.1.4 is again part of the
> containers. Looking at EPEL8, all that is built now is 1.1.4. We're
> not sure what happened, but would it be possible to get 1.2.1 pushed
> to EPEL8 again, and figure out why it was removed? We'd then need a
> rebuild of the 16.2.6 containers to 'fix' this bug.
>
> This is definitely a high urgency bug, as it impacts any deployments
> with medium to large counts of OSDs or split db/wal devices, like many
> modern deployments.
>
> https://koji.fedoraproject.org/koji/packageinfo?packageID=18747
> https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/
>
> Respectfully,
> David Orman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] One PG keeps going inconsistent (stat mismatch)

2021-09-22 Thread Simon Ironside

Hi All,

I have a recurring single PG that keeps going inconsistent. A scrub is 
enough to pick up the problem. The primary OSD log shows something like:


2021-09-22 18:08:18.502 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
1.3ff scrub starts
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff scrub : stat mismatch, got 3243/3244 objects, 67/67 clones, 
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2021-09-22 18:08:18.880 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff scrub 1 errors


It always repairs ok when I run ceph pg repair 1.3ff:

2021-09-22 18:08:47.533 7f5bdcb11700  0 log_channel(cluster) log [DBG] : 
1.3ff repair starts
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff repair : stat mismatch, got 3243/3244 objects, 67/67 clones, 
3243/3244 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 1/1 
whiteouts, 13247338496/13251516416 bytes, 0/0 manifest objects, 0/0 
hit_set_archive bytes.
2021-09-22 18:15:58.218 7f5bdcb11700 -1 log_channel(cluster) log [ERR] : 
1.3ff repair 1 errors, 1 fixed


It's happened multiple times and always with the same PG number, no 
other PG is doing this. It's a Nautilus v14.2.5 cluster using spinning 
disks with separate DB/WAL on SSDs. I don't believe there's an 
underlying hardware problem but in a bid to make sure I reweighted the 
primary OSD for this PG to 0 to get it to move to another disk. The 
backfilling is complete but on manually scrubbing the PG again it showed 
inconsistent as above.


In case it's relevant the only major activity I've performed recently 
has been gradually adding new OSD nodes and disks to the cluster, prior 
to this it had been up without issue for well over a year. The primary 
OSD for this PG was on the first new OSD I added when this issue first 
presented. The inconsistent PG issue didn't start happening immediately 
after adding it though, it was some weeks later.


Any suggestions as to how I can get rid of this problem?
Should I try reweighting the other two OSDs for this PG to 0?
Or is this a known bug that requires some specific work or just an upgrade?

Thanks,
Simon.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Anthony D'Atri

Indeed.  In a large enough cluster, even a few minutes of extra 
backfill/recovery per OSD adds up.  Say you have 100 OSD nodes, and just 3 
minutes of unnecessary backfill per.  That prolongs your upgrade by 5 hours.



> Yeah you don't want to deal with backfilling while the cluster is
> upgrading. At best it can delay the upgrade, at worst mixed version
> backfilling has (rarely) caused issues in the past.
> 
> We additionally `set noin` and disable the balancer: `ceph balancer off`.
> The former prevents broken osds from re-entering the cluster, and both of
> these similarly prevent backfilling from starting mid-upgrade.
> 
> 
> .. Dan
> 
> 
> On Wed, 22 Sep 2021, 12:18 Etienne Menguy,  wrote:
> 
>> Hello,
>> 
>> From my experience, I see three reasons :
>> - You don’t want to recover data if you already have them on a down OSD,
>> rebalancing can have a big impact on performance
>> - If upgrade/maintenance goes wrong you will want to focus on this issue
>> and not have to deal with things done by Ceph meanwhile.
>> - During an upgrade you have an ‘unusual’ cluster with different versions.
>> It’s supposed to work, but you probably want to keep it ‘boring’.
>> 
>> -
>> Etienne Menguy
>> etienne.men...@croit.io
>> 
>> 
>> 
>> 
>>> On 22 Sep 2021, at 11:51, Francois Legrand  wrote:
>>> 
>>> Hello everybody,
>>> 
>>> I have a "stupid" question. Why is it recommended in the docs to set the
>> osd flag to noout during an upgrade/maintainance (and especially during an
>> osd upgrade/maintainance) ?
>>> 
>>> In my understanding, if an osd goes down, after a while (600s by
>> default) it's marked out and the cluster will start to rebuild it's content
>> elsewhere in the cluster to maintain the redondancy of the datas. This
>> generate some transfer and load on other osds, but that's not a big deal !
>>> 
>>> As soon as the osd is back, it's marked in again and ceph is able to
>> determine which data is back and stop the recovery to reuse the unchanged
>> datas which are back. Generally, the recovery is as fast as with noout flag
>> (because with noout, the data modified during the down period still have be
>> copied to the back osd).
>>> 
>>> Thus is there an other reason apart from limiting the transfer and
>> others osds load durind the downtime ?
>>> 
>>> F
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why set osd flag to noout during upgrade ?

2021-09-22 Thread Frank Schilder
In addition, from my experience:

I often set noout, norebalance and nobackfill before doing maintenance. This 
greatly speeds up peering (when adding new OSDs) and reduces unnecessary load 
from all daemons. In particular, if there is heavy client IO going on at the 
same time, the ceph daemons are much more stable with these settings. I had, 
after shutting down one host, more OSDs crashing under combined 
peering+backfill load causing a cascade of even more OSDs crashing. The above 
settings have prevented such things from happening.

As mentioned before, it also avoids unnecessary rebuilds of objects that are 
not even modified during the service window. Having an OSD down even for 30 
minutes usually requires only a few seconds to minutes to catch up with the 
latest diffs of modified objects instead of starting a full rebuild of all 
objects regardless of their modification state.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Etienne Menguy 
Sent: 22 September 2021 12:17:39
To: ceph-users
Subject: [ceph-users] Re: Why set osd flag to noout during upgrade ?

Hello,

>From my experience, I see three reasons :
- You don’t want to recover data if you already have them on a down OSD, 
rebalancing can have a big impact on performance
- If upgrade/maintenance goes wrong you will want to focus on this issue and 
not have to deal with things done by Ceph meanwhile.
- During an upgrade you have an ‘unusual’ cluster with different versions. It’s 
supposed to work, but you probably want to keep it ‘boring’.

-
Etienne Menguy
etienne.men...@croit.io




> On 22 Sep 2021, at 11:51, Francois Legrand  wrote:
>
> Hello everybody,
>
> I have a "stupid" question. Why is it recommended in the docs to set the osd 
> flag to noout during an upgrade/maintainance (and especially during an osd 
> upgrade/maintainance) ?
>
> In my understanding, if an osd goes down, after a while (600s by default) 
> it's marked out and the cluster will start to rebuild it's content elsewhere 
> in the cluster to maintain the redondancy of the datas. This generate some 
> transfer and load on other osds, but that's not a big deal !
>
> As soon as the osd is back, it's marked in again and ceph is able to 
> determine which data is back and stop the recovery to reuse the unchanged 
> datas which are back. Generally, the recovery is as fast as with noout flag 
> (because with noout, the data modified during the down period still have be 
> copied to the back osd).
>
> Thus is there an other reason apart from limiting the transfer and others 
> osds load durind the downtime ?
>
> F
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Remoto 1.1.4 in Ceph 16.2.6 containers

2021-09-22 Thread David Orman
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2021-4b2736a28c

^^ if people want to test and provide feedback for a potential merge
to EPEL8 stable.

David

On Wed, Sep 22, 2021 at 11:43 AM David Orman  wrote:
>
> I'm wondering if this was installed using pip/pypi before, and now
> switched to using EPEL? That would explain it - 1.2.1 may never have
> been pushed to EPEL.
>
> David
>
> On Wed, Sep 22, 2021 at 11:26 AM David Orman  wrote:
> >
> > We'd worked on pushing a change to fix
> > https://tracker.ceph.com/issues/50526 for a deadlock in remoto here:
> > https://github.com/alfredodeza/remoto/pull/63
> >
> > A new version, 1.2.1, was built to help with this. With the Ceph
> > release 16.2.6 (at least), we see 1.1.4 is again part of the
> > containers. Looking at EPEL8, all that is built now is 1.1.4. We're
> > not sure what happened, but would it be possible to get 1.2.1 pushed
> > to EPEL8 again, and figure out why it was removed? We'd then need a
> > rebuild of the 16.2.6 containers to 'fix' this bug.
> >
> > This is definitely a high urgency bug, as it impacts any deployments
> > with medium to large counts of OSDs or split db/wal devices, like many
> > modern deployments.
> >
> > https://koji.fedoraproject.org/koji/packageinfo?packageID=18747
> > https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/p/
> >
> > Respectfully,
> > David Orman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer vs. Autoscaler

2021-09-22 Thread Richard Bade
If you look at the current pg_num in that pool ls detail command that
Dan mentioned you can set the pool pg_num to what that value currently
is, which will effectively pause the pg changes. I did this recently
when decreasing the number of pg's in a pool, which took several weeks
to complete. This let me get some other maintenance done before
setting the pg_num back to the target num again.
This works well for reduction, but I'm not sure if it works well for
increase as I think the pg_num may reach the target much faster and
then just the pgp_num changes till they match.

Rich

On Wed, 22 Sept 2021 at 23:06, Dan van der Ster  wrote:
>
> To get an idea how much work is left, take a look at `ceph osd pool ls
> detail`. There should be pg_num_target... The osds will merge or split PGs
> until pg_num matches that value.
>
> .. Dan
>
>
> On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza,  wrote:
>
> > Hi everyone,
> >
> > I had the autoscale_mode set to "on" and the autoscaler went to work and
> > started adjusting the number of PGs in that pool. Since this implies a
> > huge shift in data, the reweights that the balancer had carefully
> > adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
> > become nearful (we sadly have very different sized OSDs).
> >
> > Now apparently both manager modules, balancer and pg_autoscaler, have
> > the same threshold for operation, namely target_max_misplaced_ratio. So
> > the balancer won't become active as long as the pg_autoscaler is still
> > adjusting the number of PGs.
> >
> > I already set the autoscale_mode to "warn" on all pools, but apparently
> > the autoscaler is determined to finish what it started.
> >
> > Is there any way to pause the autoscaler so the balancer has a chance of
> > fixing the reweights? Because even in manual mode (ceph balancer
> > optimize), the balancer won't compute a plan when the misplaced ratio is
> > higher than target_max_misplaced_ratio.
> >
> > I know about "ceph osd reweight-*", but they adjust the reweights
> > (visible in "ceph osd tree"), whereas the balancer adjusts the "compat
> > weight-set", which I don't know how to convert back to the old-style
> > reweights.
> >
> > Best regards,
> > Jan-Philipp
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io