[ceph-users] Ceph v.15.2.15 (Octopus, stable) - OSD_SCRUB_ERRORS: 6 scrub errors

2022-04-13 Thread PenguinOS

Hello,

My Ceph cluster with 3 nodes is showing a HEALTH_ERR, with the following 
errors:


 * OSD_SCRUB_ERRORS: 6 scrub errors
 * PG_DAMAGED: Possible data damage: 6 pgs inconsistent
 * CEPHADM_FAILED_DAEMON: 3 failed cephadm daemon(s)
 * MON_CLOCK_SKEW: clock skew detected on mon.ceph3
 * MON_DOWN: 1/3 mons down, quorum ceph2,ceph3
 * OSD_NEARFULL: 4 nearfull osd(s)
 * PG_NOT_DEEP_SCRUBBED: 2 pgs not deep-scrubbed in time

All OSDs (18) are up though,and I don't see any error in each server's 
dmesg logs for hard drive issues.


Current cluster's status page is showing: Scrubbing: Active

Is the problem recoverable?


Thanks,

Mike

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using CephFS in High Performance (and Throughput) Compute Use Cases

2022-04-13 Thread Mark Nelson

Hi Manuel!


I'm the one that submitted the the io500 results for Red Hat.


Ok, so a couple of things.  First be aware that vendors are not required 
to use any form of replication whatsoever for the IO500 test 
submissions.  Our results are thus using 1x replication. :) But!  2x 
should only hurt your write numbers, and maybe not as much if you are 
front-end network limited.  EC will likely hurt fairly badly.


Next, I used ephemeral pinning for the easy tests with something like 
30-40 active/active MDS.  I tested up to 100 (10 per server!) but after 
about 30-40 contention starts becoming a really big problem.  It helped 
improve performance on the easy tests with ephemeral pinning but 
actually hurt on the hard tests where we have to split and export dirfrags.


AMD (dual socket) nodes may be a bit more challenging.  There was a 
presentation at the Ceph BOF at SC2021 by Andra Pataki at the Flatiron 
Instute about running Ceph on dual socket AMD Rome setups with large 
numbers of NVMe drives.  He explained that they were seeing performance 
variations when running all of the NVMe drives together and believes 
they tracked it down to the PCIe scheduling issues.  He noted that they 
could get good consistent throughput to one device if the "Preferred 
I/O" bios setting was set, but at even further expense to all of the 
other devices in the system. I'm not sure if this is simply due to using 
some of the PCIe lanes for cross-socket communication or a bigger issue 
with the controllers.


Finally, the results I got across many tests were fairly chaotic.  When 
you have 20-30 active/active MDSes the (hard) benchmark results end up 
dominated by how quickly dynamic subtree partitioning takes place, and 
it turns out that I could essentially DDOS the authoritative MDS for the 
parent directory preventing dirfrag exports from happening at all.  Even 
after the splits take place, there are these performance oscillations 
that may stem from continued dirfrag splits (or something else!).  I 
also saw that the primary MDS was extremely busy with encoding of 
subtrees for journal writes.  I had an experimental PR floating around 
that pre-created and exported dirfrags based on an "expected size" 
directory xattr but we never ended up merging it. Zheng also made a PR 
to remove the subtree map from journal updates that may be a big win 
here too, but that also never merged.


Finally the other big limitation I saw was that we had extremely 
inconsistent client performance and the IO500 penalizes you heavily for 
it.  Some clients were taking nearly twice as long to complete their 
work as others and this can really drag your score down.


Ultimately the results that were sent to the IO500 were the best from 
something like 10 practice runs (and that was after something like 200 
debugging runs) and there was pretty high variation between them.  I was 
pretty happy with the numbers that we got, but if could fix some of the 
issues encountered I suspect that we could have gotten an overall score 
2x-3x higher and likely far more consistently.



Mark


On 4/13/22 04:56, Manuel Holtgrewe wrote:

Dear all,

I want to revive this old thread. I now have the following hardware at hand:

- 8 servers with
- AMD 7413 processors (24 cores, 48 hw threads)
- 260 GB of RAM
- 10 NVMEs each
- dual 25GbE towards clients and
- dual 100GbE for the cluster network

The 2x25GbE NICs go into a dedicated VLT switch pair which is connected
upstream with an overall of 4x100GbE DAC so the network path from the ceph
cluster to my clients is limited to about 40GB/sec in network bandwidth.

I realize that the docs recommend against splitting frontend/cluster
network but I can change this later on.

My focus is not on having multiple programs writing to the same file so
lazy I/O is not that interesting for me I believe. I would be happy to
later run a patched version of io500, though. I found [1] from croit to be
quite useful.

The vendor gives 3.5GB/sec sequential write and 7GB/sec sequential read per
disk (which I usually read as "it won't get faster but real-world
performance is a different pair of shoes"). So the theoretical maximum per
node would be 35GB/sec write and 70GB/sec read which I will never achieve,
but also the 2x25GbE network should not be saturated immediately either.

I've done some fio benchmarks of the NVMEs and get ~2GB/sec write
performance when run like this for NP=16 threads. I have attached a copy of
the output.

```
# fio --filename=/dev/nvme0n1 --direct=1 --fsync=1 --rw=write --bs=4k
--numjobs=$NP \
   --iodepth=1 --runtime=60 --time_based --group_reporting
--name=4k-sync-write-$NP
```

I'm running Ceph 16.2.7 deployed via cephadm. Client and server run on
Rocky Linux 8, clients currently run kernel 4.18.0-348.7.1.el8_5.x86_64
while the servers run 4.18.0-348.2.1.el8_5.x86_64.

I changed the following configuration from their defaults:

osd_memory_target   osd:16187498659
mds_cache_memory_limit  global

[ceph-users] Cephadm + OpenStack Keystone Authentication

2022-04-13 Thread Marcus Bahn
Hello everyone, 

I'm currently having a problem to use Cephadm and integrate the RadosGW and 
Object Storage into OpenStack. 
If I try to use Object Storage via Swift in OpenStack it does not work. While 
trying in Horizon, I simply get logged out of the admin user with the error 
message: "Unauthorized. Redirect to login." and "Unable to get the Swift 
container listing.". 
On OpenStack node to test the authentication: 
``` 
[root@xxx ~]# swift list 
Account GET failed: 
https://PublicIP:8080/swift/v1/AUTH_c72e4eab833447ea92816a3f9925cd0b?format=json
 401 Unauthorized [first 60 chars of response] 
b'{"Code":"AccessDenied","RequestId":"tx019cf2e2cfa84bc21-' 
Failed Transaction ID: tx019cf2e2cfa84bc21-006256e07e-a79117-default 
``` 

All RGW's are up and running. 
ceph orch ls 
rgw.name ?:8000 2/2 61s ago 9d host01;host02 

Just fyi, the RGWs use port 8000, but on my haproxy.cfg for my public server, I 
expose and use port 8080 that lead to the RGWs with Port 8000. That works, as I 
tested that with an S3 client. 

What I did: 
On OpenStack side: 
``` 
openstack service create --name=swift --description="Swift Service" 
object-store 
openstack endpoint create --region RegionOne object-store public 
"https://publicIP:8080/swift/v1/AUTH_%(tenant_id)s" 
openstack endpoint create --region RegionOne object-store internal 
https://publicIP:8080/swift/v1/AUTH_%(tenant_id)s" 
openstack endpoint create --region RegionOne object-store admin 
https://publicIP:8080/swift/v1/AUTH_%(tenant_id)s" 
``` 
And created the user `object` with a password. This user has the admin role in 
service project and in my project. 
The port 8080 itself is open and functioning. 


On Ceph Node: 
``` 
ceph config set mgr rgw_keystone_api_version 3 
ceph config set mgr rgw_keystone_url https://publicIP:5000 
ceph config set mgr rgw_keystone_admin_user object 
ceph config set mgr rgw_keystone_password XXX 
ceph config set mgr rgw_keystone_admin_password XXX 
ceph config set mgr rgw_keystone_admin_domain Default 
ceph config set mgr rgw_keystone_admin_project service 
ceph config set mgr rgw_keystone_accepted_roles admin,member,_member_ 
ceph config set mgr rgw_keystone_token_cache_size 100 
ceph config set mgr rgw_keystone_implicit_tenants false 
ceph config set mgr rgw_s3_auth_use_keystone true 
ceph config set mgr rgw_keystone_verify_ssl false 
ceph config set mgr rgw_swift_account_in_url true 
ceph orch redeploy rgw.xxx 
``` 


I used this documentation as reference: 
[ 
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
 | 
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
 ] 
Sadly, I can't find any documentation that's CephAdm specific. Or am I 
overseeing something? 

Does anybody have an idea what and where I did something wrong? 
Is the use of `ceph config set mgr ...` right? 

cephadm version 
Using recent ceph image quay.io/ceph/ceph@sha256:xxx 
ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable) 

OpenStack Version: Wallaby 

I hope that everything is included that's needed. 

Thanks and best regards, 
Marcus 

- 
Marcus Bahn 
Fraunhofer-Institut für Algorithmen 
und Wissenschaftliches Rechnen - SCAI 

Schloss Birlinghoven 
53757 Sankt Augustin 
Germany 
Phone: +49 2241 14-4202 
E-Mail: [ mailto:marcus.b...@scai.fraunhofer.de | 
marcus.b...@scai.fraunhofer.de ] 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
One more thing, could you please also share the `ceph osd pool
autoscale-status` ?


On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham
 wrote:
>
> Thank you Dan! I will definitely disable autoscaler on the rest of our pools. 
> I can't get the PG numbers today, but I will try to get them tomorrow. We 
> definitely want to get this under control.
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Tuesday, April 12, 2022 2:46 PM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi Ray,
>
> Disabling the autoscaler on all pools is probably a good idea. At least until 
> https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> susceptible to that -- but better safe than sorry).
>
> To pause the ongoing PG merges, you can indeed set the pg_num to the current 
> value. This will allow the ongoing merge complete and prevent further merges 
> from starting.
> From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> pgp_num_target... If you share the current values of those we can help advise 
> what you need to set the pg_num to to effectively pause things where they are.
>
> BTW -- I'm going to create a request in the tracker that we improve the pg 
> autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> out a split/merge operation and avoid taking one-way decisions without 
> permission from the administrator. The autoscaler is meant to be helpful, not 
> degrade a cluster for 100 days!
>
> Cheers, Dan
>
>
>
> On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
>  wrote:
> >
> > Hi Everyone,
> >
> > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > rebalancing of misplaced objects is overwhelming the cluster and impacting 
> > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore 
> > OSDs. We have to pause the rebalancing if misplaced objects or we're going 
> > to fall over.
> >
> > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > will take us over 100 days to complete at our current recovery speed. We 
> > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > already on the path to the lower PG count and won't stop adding to our 
> > misplaced count after drop below 5%. What can we do to stop the cluster 
> > from finding more misplaced objects to rebalance? Should we set the PG num 
> > manually to what our current count is? Or will that cause even more havoc?
> >
> > Any other thoughts or ideas? My goals are to stop the rebalancing 
> > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy 
> > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact 
> > when you aren't 100% active+clean).
> >
> > Thank you,
> > Ray
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Ray Cunningham
Perfect timing, I was just about to reply. We have disabled autoscaler on all 
pools now. 

Unfortunately, I can't just copy and paste from this system... 

`ceph osd pool ls detail` only 2 pools have any difference. 
pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048

` ceph osd pool autoscale-status`
Size is defined
target size is empty
Rate is 7 for all pools except pool7, which is 1.333730697632
Raw capacity is defined
Ratio for pool1 is .0177, pool7 is .4200 and all others is 0
Target and Effective Ratio is empty
Bias is 1.0 for all
PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32. 
New PG_NUM is empty
Autoscale is now off for all
Profile is scale-up


We have set norebalance and nobackfill and are watching to see what happens. 

Thank you,
Ray 

-Original Message-
From: Dan van der Ster  
Sent: Wednesday, April 13, 2022 10:00 AM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Stop Rebalancing

One more thing, could you please also share the `ceph osd pool 
autoscale-status` ?


On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham  
wrote:
>
> Thank you Dan! I will definitely disable autoscaler on the rest of our pools. 
> I can't get the PG numbers today, but I will try to get them tomorrow. We 
> definitely want to get this under control.
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Tuesday, April 12, 2022 2:46 PM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi Ray,
>
> Disabling the autoscaler on all pools is probably a good idea. At least until 
> https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> susceptible to that -- but better safe than sorry).
>
> To pause the ongoing PG merges, you can indeed set the pg_num to the current 
> value. This will allow the ongoing merge complete and prevent further merges 
> from starting.
> From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> pgp_num_target... If you share the current values of those we can help advise 
> what you need to set the pg_num to to effectively pause things where they are.
>
> BTW -- I'm going to create a request in the tracker that we improve the pg 
> autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> out a split/merge operation and avoid taking one-way decisions without 
> permission from the administrator. The autoscaler is meant to be helpful, not 
> degrade a cluster for 100 days!
>
> Cheers, Dan
>
>
>
> On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
>  wrote:
> >
> > Hi Everyone,
> >
> > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > rebalancing of misplaced objects is overwhelming the cluster and impacting 
> > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore 
> > OSDs. We have to pause the rebalancing if misplaced objects or we're going 
> > to fall over.
> >
> > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > will take us over 100 days to complete at our current recovery speed. We 
> > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > already on the path to the lower PG count and won't stop adding to our 
> > misplaced count after drop below 5%. What can we do to stop the cluster 
> > from finding more misplaced objects to rebalance? Should we set the PG num 
> > manually to what our current count is? Or will that cause even more havoc?
> >
> > Any other thoughts or ideas? My goals are to stop the rebalancing 
> > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy 
> > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact 
> > when you aren't 100% active+clean).
> >
> > Thank you,
> > Ray
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Developer Summit - Reef

2022-04-13 Thread Mike Perez
Hi everyone,

The summit continues in five minutes with topics on RGW:

https://bluejeans.com/908675367/
https://pad.ceph.com/p/cds-reef

On Tue, Apr 12, 2022 at 8:11 AM Mike Perez  wrote:
>
> Hi everyone,
>
> The Ceph Developer Summit for Reef is now starting on discussions on
> the Orchestrator
> https://ceph.io/en/community/events/2022/ceph-developer-summit-reef/
>
> On Fri, Apr 8, 2022 at 1:22 PM Mike Perez  wrote:
> >
> > Hi everyone,
> >
> > If you're a contributor to Ceph, please join us for the next summit on
> > April 12-22nd. The topics range from the different components,
> > Teuthology, and governance.
> >
> > Information such as schedule and etherpads can be found on the blog
> > post or directly on the main etherpad:
> >
> > https://pad.ceph.com/p/cds-reef
> > https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/
> >
> > I am looking forward to the discussions!
> >
> > --
> > Mike Perez
>
>
>
> --
> Mike Perez



-- 
Mike Perez

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Ray Cunningham
All pools have gone backfillfull.


Thank you,

Ray Cunningham



Systems Engineering and Services Manager

keepertechnology

(571) 223-7242


From: Ray Cunningham
Sent: Wednesday, April 13, 2022 10:15:56 AM
To: Dan van der Ster 
Cc: ceph-users@ceph.io 
Subject: RE: [ceph-users] Stop Rebalancing

Perfect timing, I was just about to reply. We have disabled autoscaler on all 
pools now.

Unfortunately, I can't just copy and paste from this system...

`ceph osd pool ls detail` only 2 pools have any difference.
pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048

` ceph osd pool autoscale-status`
Size is defined
target size is empty
Rate is 7 for all pools except pool7, which is 1.333730697632
Raw capacity is defined
Ratio for pool1 is .0177, pool7 is .4200 and all others is 0
Target and Effective Ratio is empty
Bias is 1.0 for all
PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
New PG_NUM is empty
Autoscale is now off for all
Profile is scale-up


We have set norebalance and nobackfill and are watching to see what happens.

Thank you,
Ray

-Original Message-
From: Dan van der Ster 
Sent: Wednesday, April 13, 2022 10:00 AM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Stop Rebalancing

One more thing, could you please also share the `ceph osd pool 
autoscale-status` ?


On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham  
wrote:
>
> Thank you Dan! I will definitely disable autoscaler on the rest of our pools. 
> I can't get the PG numbers today, but I will try to get them tomorrow. We 
> definitely want to get this under control.
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Tuesday, April 12, 2022 2:46 PM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi Ray,
>
> Disabling the autoscaler on all pools is probably a good idea. At least until 
> https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> susceptible to that -- but better safe than sorry).
>
> To pause the ongoing PG merges, you can indeed set the pg_num to the current 
> value. This will allow the ongoing merge complete and prevent further merges 
> from starting.
> From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> pgp_num_target... If you share the current values of those we can help advise 
> what you need to set the pg_num to to effectively pause things where they are.
>
> BTW -- I'm going to create a request in the tracker that we improve the pg 
> autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> out a split/merge operation and avoid taking one-way decisions without 
> permission from the administrator. The autoscaler is meant to be helpful, not 
> degrade a cluster for 100 days!
>
> Cheers, Dan
>
>
>
> On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
>  wrote:
> >
> > Hi Everyone,
> >
> > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > rebalancing of misplaced objects is overwhelming the cluster and impacting 
> > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore 
> > OSDs. We have to pause the rebalancing if misplaced objects or we're going 
> > to fall over.
> >
> > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > will take us over 100 days to complete at our current recovery speed. We 
> > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > already on the path to the lower PG count and won't stop adding to our 
> > misplaced count after drop below 5%. What can we do to stop the cluster 
> > from finding more misplaced objects to rebalance? Should we set the PG num 
> > manually to what our current count is? Or will that cause even more havoc?
> >
> > Any other thoughts or ideas? My goals are to stop the rebalancing 
> > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy 
> > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact 
> > when you aren't 100% active+clean).
> >
> > Thank you,
> > Ray
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Ray Cunningham
No repair IO and misplaced objects increasing with norebalance and nobackfill 
set.


Thank you,

Ray


From: Ray Cunningham 
Sent: Wednesday, April 13, 2022 10:38:29 AM
To: Dan van der Ster 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Stop Rebalancing

All pools have gone backfillfull.


Thank you,

Ray Cunningham



Systems Engineering and Services Manager

keepertechnology

(571) 223-7242


From: Ray Cunningham
Sent: Wednesday, April 13, 2022 10:15:56 AM
To: Dan van der Ster 
Cc: ceph-users@ceph.io 
Subject: RE: [ceph-users] Stop Rebalancing

Perfect timing, I was just about to reply. We have disabled autoscaler on all 
pools now.

Unfortunately, I can't just copy and paste from this system...

`ceph osd pool ls detail` only 2 pools have any difference.
pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048

` ceph osd pool autoscale-status`
Size is defined
target size is empty
Rate is 7 for all pools except pool7, which is 1.333730697632
Raw capacity is defined
Ratio for pool1 is .0177, pool7 is .4200 and all others is 0
Target and Effective Ratio is empty
Bias is 1.0 for all
PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
New PG_NUM is empty
Autoscale is now off for all
Profile is scale-up


We have set norebalance and nobackfill and are watching to see what happens.

Thank you,
Ray

-Original Message-
From: Dan van der Ster 
Sent: Wednesday, April 13, 2022 10:00 AM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Stop Rebalancing

One more thing, could you please also share the `ceph osd pool 
autoscale-status` ?


On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham  
wrote:
>
> Thank you Dan! I will definitely disable autoscaler on the rest of our pools. 
> I can't get the PG numbers today, but I will try to get them tomorrow. We 
> definitely want to get this under control.
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Tuesday, April 12, 2022 2:46 PM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi Ray,
>
> Disabling the autoscaler on all pools is probably a good idea. At least until 
> https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> susceptible to that -- but better safe than sorry).
>
> To pause the ongoing PG merges, you can indeed set the pg_num to the current 
> value. This will allow the ongoing merge complete and prevent further merges 
> from starting.
> From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> pgp_num_target... If you share the current values of those we can help advise 
> what you need to set the pg_num to to effectively pause things where they are.
>
> BTW -- I'm going to create a request in the tracker that we improve the pg 
> autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> out a split/merge operation and avoid taking one-way decisions without 
> permission from the administrator. The autoscaler is meant to be helpful, not 
> degrade a cluster for 100 days!
>
> Cheers, Dan
>
>
>
> On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
>  wrote:
> >
> > Hi Everyone,
> >
> > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > rebalancing of misplaced objects is overwhelming the cluster and impacting 
> > MON DB compaction, deep scrub repairs and us upgrading legacy bluestore 
> > OSDs. We have to pause the rebalancing if misplaced objects or we're going 
> > to fall over.
> >
> > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > will take us over 100 days to complete at our current recovery speed. We 
> > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > already on the path to the lower PG count and won't stop adding to our 
> > misplaced count after drop below 5%. What can we do to stop the cluster 
> > from finding more misplaced objects to rebalance? Should we set the PG num 
> > manually to what our current count is? Or will that cause even more havoc?
> >
> > Any other thoughts or ideas? My goals are to stop the rebalancing 
> > temporarily so we can deep scrub and repair inconsistencies, upgrade legacy 
> > bluestore OSDs and compact our MON DBs (supposedly MON DBs don't compact 
> > when you aren't 100% active+clean).
> >
> > Thank you,
> > Ray
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
Hi, Thanks.

norebalance/nobackfill are useful to pause ongoing backfilling, but
aren't the best option now to get the PGs to go active+clean and let
the mon db come back under control. Unset those before continuing.

I think you need to set the pg_num for pool1 to something close to but
less than 926. (Or whatever the pg_num_target is when you run the
command below).
The idea is to let a few more merges complete successfully but then
once all PGs are active+clean to take a decision about the other
interventions you want to carry out.
So this ought to be good:
ceph osd pool set pool1 pg_num 920

Then for pool7 this looks like splitting is ongoing. You should be
able to pause that by setting the pg_num to something just above 883.
I would do:
ceph osd pool set pool7 pg_num 890

It may even be fastest to just set those pg_num values to exactly what
the current pgp_num_target is. You can try it.

Once your cluster is stable again, then you should set those to the
nearest power of two.
Personally I would wait for #53729 to be fixed before embarking on
future pg_num changes.
(You'll have to mute a warning in the meantime -- check the docs after
the warning appears).

Cheers, dan

On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham
 wrote:
>
> Perfect timing, I was just about to reply. We have disabled autoscaler on all 
> pools now.
>
> Unfortunately, I can't just copy and paste from this system...
>
> `ceph osd pool ls detail` only 2 pools have any difference.
> pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
>
> ` ceph osd pool autoscale-status`
> Size is defined
> target size is empty
> Rate is 7 for all pools except pool7, which is 1.333730697632
> Raw capacity is defined
> Ratio for pool1 is .0177, pool7 is .4200 and all others is 0
> Target and Effective Ratio is empty
> Bias is 1.0 for all
> PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> New PG_NUM is empty
> Autoscale is now off for all
> Profile is scale-up
>
>
> We have set norebalance and nobackfill and are watching to see what happens.
>
> Thank you,
> Ray
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Wednesday, April 13, 2022 10:00 AM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> One more thing, could you please also share the `ceph osd pool 
> autoscale-status` ?
>
>
> On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham 
>  wrote:
> >
> > Thank you Dan! I will definitely disable autoscaler on the rest of our 
> > pools. I can't get the PG numbers today, but I will try to get them 
> > tomorrow. We definitely want to get this under control.
> >
> > Thank you,
> > Ray
> >
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Tuesday, April 12, 2022 2:46 PM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > Hi Ray,
> >
> > Disabling the autoscaler on all pools is probably a good idea. At least 
> > until https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> > susceptible to that -- but better safe than sorry).
> >
> > To pause the ongoing PG merges, you can indeed set the pg_num to the 
> > current value. This will allow the ongoing merge complete and prevent 
> > further merges from starting.
> > From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> > pgp_num_target... If you share the current values of those we can help 
> > advise what you need to set the pg_num to to effectively pause things where 
> > they are.
> >
> > BTW -- I'm going to create a request in the tracker that we improve the pg 
> > autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> > out a split/merge operation and avoid taking one-way decisions without 
> > permission from the administrator. The autoscaler is meant to be helpful, 
> > not degrade a cluster for 100 days!
> >
> > Cheers, Dan
> >
> >
> >
> > On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
> >  wrote:
> > >
> > > Hi Everyone,
> > >
> > > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > > rebalancing of misplaced objects is overwhelming the cluster and 
> > > impacting MON DB compaction, deep scrub repairs and us upgrading legacy 
> > > bluestore OSDs. We have to pause the rebalancing if misplaced objects or 
> > > we're going to fall over.
> > >
> > > Autoscaler-status tells us that we are reducing our PGs by 700'ish which 
> > > will take us over 100 days to complete at our current recovery speed. We 
> > > disabled autoscaler on our biggest pool, but I'm concerned that it's 
> > > already on the path to the lower PG count and won't stop adding to our 
> > > misplaced count after drop below 5%. What can we do to stop the cluster 
> > > from finding more misplaced objects to rebalance? Should we set the PG 
> > > num manually to what our current count is? Or will that cause even m

[ceph-users] Call for Submissions IO500 ISC 2022 list

2022-04-13 Thread IO500 Committee



Stabilization Period: Monday, April 4th - Friday, April 15th
Submission Deadline: Friday, May 13th, 2022 AoE

The IO500 is now accepting and encouraging submissions for the upcoming 
10th semi-annual IO500 list, in conjunction with ISC-HPC'22. Once again, 
we are also accepting submissions to the 10 Node Challenge to encourage 
the submission of small scale results. The new ranked lists will be 
announced via live-stream during "The IO500 and the Virtual Institute of 
I/O" BoF [1]. We hope to see many new results.


What's New
With ISC'22, we are proposing a separation of the list into separate 
Production and Research lists, to better reflect the important 
distinction between storage systems that run in production environments 
and those that may use more experimental hardware and software 
configurations.


Since ISC'21, the IO500 follows a two-staged approach. First, there will 
be a two-week stabilization period during which we encourage the 
community to verify that the benchmark runs properly on a variety of 
storage systems. During this period the benchmark may be updated based 
upon feedback from the community. The final benchmark will then be 
released. We expect that runs compliant with the rules made during the 
stabilization period will be valid as a final submission unless a 
significant defect is found.


We are now creating a more detailed schema to describe the hardware and 
software of the system under test and provide the first set of tools to 
ease capturing of this information for inclusion with the submission. 
Further details will be released on the submission page [2].


We are evaluating the inclusion of optional test phases for additional 
key workloads - split easy/hard find phases, 4KB and 1MB random 
read/write phases, and concurrent metadata operations. This is called an 
extended run. At the moment, we collect the information to verify that 
additional phases do not significantly impact the results of a standard 
run and an extended run to facilitate comparisons between the existing 
and new benchmark phases. In a future release, we may include some or 
all of these results as part of the standard benchmark. The extended 
results are not currently included in the scoring of any ranked list.


Background
The benchmark suite is designed to be easy to run and the community has 
multiple active support channels to help with any questions. Please note 
that submissions of all sizes are welcome; the site has customizable 
sorting, so it is possible to submit on a small system and still get a 
very good per-client score, for example. Additionally, the list is about 
much more than just the raw rank; all submissions help the community by 
collecting and publishing a wider corpus of data. More details below.


Following the success of the Top500 in collecting and analyzing 
historical trends in supercomputer technology and evolution, the IO500 
was created in 2017, published its first list at SC17, and has grown 
continually since then. The need for such an initiative has long been 
known within High-Performance Computing; however, defining appropriate 
benchmarks has long been challenging. Despite this challenge, the 
community, after long and spirited discussion, finally reached consensus 
on a suite of benchmarks and a metric for resolving the scores into a 
single ranking.


The multi-fold goals of the benchmark suite are as follows:
Maximizing simplicity in running the benchmark suite
Encouraging optimization and documentation of tuning parameters for 
performance

Allowing submitters to highlight their "hero run" performance numbers
Forcing submitters to simultaneously report performance for challenging 
IO patterns.
Specifically, the benchmark suite includes a hero-run of both IOR and 
mdtest configured however possible to maximize performance and establish 
an upper-bound for performance. It also includes an IOR and mdtest run 
with highly prescribed parameters in an attempt to determine a lower 
performance bound. Finally, it includes a namespace search as this has 
been determined to be a highly sought-after feature in HPC storage 
systems that has historically not been well-measured. Submitters are 
encouraged to share their tuning insights for publication.


The goals of the community are also multi-fold:
Gather historical data for the sake of analysis and to aid predictions 
of storage futures
Collect tuning information to share valuable performance optimizations 
across the community
Encourage vendors and designers to optimize for workloads beyond "hero 
runs"

Establish bounded expectations for users, procurers, and administrators

10 Node I/O Challenge
The 10 Node Challenge is conducted using the regular IO500 benchmark, 
however, with the rule that exactly 10 client nodes must be used to run 
the benchmark. You may use any shared storage with any number of 
servers. When submitting for the IO500 list, you can opt-in for 
"Participate in the 10 compute node challeng

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Ray Cunningham
Thank you so much, Dan! 

Can you confirm for me that for pool7, which has 2048/2048 for pg_num and 
883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be 
different for a single pool, or does pg_num and pgp_num have to always be the 
same? 

IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at 890, 
is that ok? Because if we reduce the pg_num by 1200 it will just start a whole 
new load of misplaced object rebalancing. Won't it? 

Thank you,
Ray 
 

-Original Message-
From: Dan van der Ster  
Sent: Wednesday, April 13, 2022 11:11 AM
To: Ray Cunningham 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Stop Rebalancing

Hi, Thanks.

norebalance/nobackfill are useful to pause ongoing backfilling, but aren't the 
best option now to get the PGs to go active+clean and let the mon db come back 
under control. Unset those before continuing.

I think you need to set the pg_num for pool1 to something close to but less 
than 926. (Or whatever the pg_num_target is when you run the command below).
The idea is to let a few more merges complete successfully but then once all 
PGs are active+clean to take a decision about the other interventions you want 
to carry out.
So this ought to be good:
ceph osd pool set pool1 pg_num 920

Then for pool7 this looks like splitting is ongoing. You should be able to 
pause that by setting the pg_num to something just above 883.
I would do:
ceph osd pool set pool7 pg_num 890

It may even be fastest to just set those pg_num values to exactly what the 
current pgp_num_target is. You can try it.

Once your cluster is stable again, then you should set those to the nearest 
power of two.
Personally I would wait for #53729 to be fixed before embarking on future 
pg_num changes.
(You'll have to mute a warning in the meantime -- check the docs after the 
warning appears).

Cheers, dan

On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham  
wrote:
>
> Perfect timing, I was just about to reply. We have disabled autoscaler on all 
> pools now.
>
> Unfortunately, I can't just copy and paste from this system...
>
> `ceph osd pool ls detail` only 2 pools have any difference.
> pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
>
> ` ceph osd pool autoscale-status`
> Size is defined
> target size is empty
> Rate is 7 for all pools except pool7, which is 1.333730697632 Raw 
> capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all 
> others is 0 Target and Effective Ratio is empty Bias is 1.0 for all
> PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> New PG_NUM is empty
> Autoscale is now off for all
> Profile is scale-up
>
>
> We have set norebalance and nobackfill and are watching to see what happens.
>
> Thank you,
> Ray
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Wednesday, April 13, 2022 10:00 AM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> One more thing, could you please also share the `ceph osd pool 
> autoscale-status` ?
>
>
> On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham 
>  wrote:
> >
> > Thank you Dan! I will definitely disable autoscaler on the rest of our 
> > pools. I can't get the PG numbers today, but I will try to get them 
> > tomorrow. We definitely want to get this under control.
> >
> > Thank you,
> > Ray
> >
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Tuesday, April 12, 2022 2:46 PM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > Hi Ray,
> >
> > Disabling the autoscaler on all pools is probably a good idea. At least 
> > until https://tracker.ceph.com/issues/53729 is fixed. (You are likely not 
> > susceptible to that -- but better safe than sorry).
> >
> > To pause the ongoing PG merges, you can indeed set the pg_num to the 
> > current value. This will allow the ongoing merge complete and prevent 
> > further merges from starting.
> > From `ceph osd pool ls detail` you'll see pg_num, pgp_num, pg_num_target, 
> > pgp_num_target... If you share the current values of those we can help 
> > advise what you need to set the pg_num to to effectively pause things where 
> > they are.
> >
> > BTW -- I'm going to create a request in the tracker that we improve the pg 
> > autoscaler heuristic. IMHO the autoscaler should estimate the time to carry 
> > out a split/merge operation and avoid taking one-way decisions without 
> > permission from the administrator. The autoscaler is meant to be helpful, 
> > not degrade a cluster for 100 days!
> >
> > Cheers, Dan
> >
> >
> >
> > On Tue, Apr 12, 2022 at 9:04 PM Ray Cunningham 
> >  wrote:
> > >
> > > Hi Everyone,
> > >
> > > We just upgraded our 640 OSD cluster to Ceph 16.2.7 and the resulting 
> > > rebalancing of misplaced objects is overwhelming the cluster and 
> > > impacting MON DB compaction, deep scrub repairs and 

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
I would set the pg_num, not pgp_num. In older versions of ceph you could
manipulate these things separately, but in pacific I'm not confident about
what setting pgp_num directly will do in this exact scenario.

To understand, the difference between these two depends on if you're
splitting or merging.
First, definitions: pg_num is the number of PGs and pgp_num is the number
used for placing objects.

So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
store data, and the other pg_num-pgp_num PGs are sitting empty.

To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
then decreases pg_num as the PGs are emptied to actually delete the now
empty PGs.

Splitting is similar but in reverse: first, Ceph creates new empty PGs by
increasing pg_num. Then it gradually increases pgp_num to start sending
data to the new PGs.

That's the general idea, anyway.

Long story short, set pg_num to something close to the current
pgp_num_target.

.. Dan


On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
wrote:

> Thank you so much, Dan!
>
> Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> different for a single pool, or does pg_num and pgp_num have to always be
> the same?
>
> IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> 890, is that ok? Because if we reduce the pg_num by 1200 it will just start
> a whole new load of misplaced object rebalancing. Won't it?
>
> Thank you,
> Ray
>
>
> -Original Message-
> From: Dan van der Ster 
> Sent: Wednesday, April 13, 2022 11:11 AM
> To: Ray Cunningham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Stop Rebalancing
>
> Hi, Thanks.
>
> norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> the best option now to get the PGs to go active+clean and let the mon db
> come back under control. Unset those before continuing.
>
> I think you need to set the pg_num for pool1 to something close to but
> less than 926. (Or whatever the pg_num_target is when you run the command
> below).
> The idea is to let a few more merges complete successfully but then once
> all PGs are active+clean to take a decision about the other interventions
> you want to carry out.
> So this ought to be good:
> ceph osd pool set pool1 pg_num 920
>
> Then for pool7 this looks like splitting is ongoing. You should be able to
> pause that by setting the pg_num to something just above 883.
> I would do:
> ceph osd pool set pool7 pg_num 890
>
> It may even be fastest to just set those pg_num values to exactly what the
> current pgp_num_target is. You can try it.
>
> Once your cluster is stable again, then you should set those to the
> nearest power of two.
> Personally I would wait for #53729 to be fixed before embarking on future
> pg_num changes.
> (You'll have to mute a warning in the meantime -- check the docs after the
> warning appears).
>
> Cheers, dan
>
> On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham <
> ray.cunning...@keepertech.com> wrote:
> >
> > Perfect timing, I was just about to reply. We have disabled autoscaler
> on all pools now.
> >
> > Unfortunately, I can't just copy and paste from this system...
> >
> > `ceph osd pool ls detail` only 2 pools have any difference.
> > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
> >
> > ` ceph osd pool autoscale-status`
> > Size is defined
> > target size is empty
> > Rate is 7 for all pools except pool7, which is 1.333730697632 Raw
> > capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all
> > others is 0 Target and Effective Ratio is empty Bias is 1.0 for all
> > PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> > New PG_NUM is empty
> > Autoscale is now off for all
> > Profile is scale-up
> >
> >
> > We have set norebalance and nobackfill and are watching to see what
> happens.
> >
> > Thank you,
> > Ray
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Wednesday, April 13, 2022 10:00 AM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > One more thing, could you please also share the `ceph osd pool
> autoscale-status` ?
> >
> >
> > On Tue, Apr 12, 2022 at 9:50 PM Ray Cunningham <
> ray.cunning...@keepertech.com> wrote:
> > >
> > > Thank you Dan! I will definitely disable autoscaler on the rest of our
> pools. I can't get the PG numbers today, but I will try to get them
> tomorrow. We definitely want to get this under control.
> > >
> > > Thank you,
> > > Ray
> > >
> > >
> > > -Original Message-
> > > From: Dan van der Ster 
> > > Sent: Tuesday, April 12, 2022 2:46 PM
> > > To: Ray Cunningham 
> > > Cc: ceph-users@ceph.io
> > > Subject: Re: [ceph-users] Stop Rebalancing
> > >
> > > Hi Ray,
> > >
> > > Disabling the autoscaler on all pools is probably a go

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Gregory Farnum
On Wed, Apr 13, 2022 at 10:01 AM Dan van der Ster  wrote:
>
> I would set the pg_num, not pgp_num. In older versions of ceph you could
> manipulate these things separately, but in pacific I'm not confident about
> what setting pgp_num directly will do in this exact scenario.
>
> To understand, the difference between these two depends on if you're
> splitting or merging.
> First, definitions: pg_num is the number of PGs and pgp_num is the number
> used for placing objects.
>
> So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
> store data, and the other pg_num-pgp_num PGs are sitting empty.

Wait, what? That's not right! pgp_num is pg *placement* number; it
controls how we map PGs to OSDs. But the full pg still exists as its
own thing on the OSD and has its own data structures and objects. If
currently the cluster has reduced pgp_num it has changed the locations
of PGs, but it hasn't merged any PGs together. Changing the pg_num and
causing merges will invoke a whole new workload which can be pretty
substantial.
-Greg

>
> To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
> then decreases pg_num as the PGs are emptied to actually delete the now
> empty PGs.
>
> Splitting is similar but in reverse: first, Ceph creates new empty PGs by
> increasing pg_num. Then it gradually increases pgp_num to start sending
> data to the new PGs.
>
> That's the general idea, anyway.
>
> Long story short, set pg_num to something close to the current
> pgp_num_target.
>
> .. Dan
>
>
> On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
> wrote:
>
> > Thank you so much, Dan!
> >
> > Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> > different for a single pool, or does pg_num and pgp_num have to always be
> > the same?
> >
> > IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> > 890, is that ok? Because if we reduce the pg_num by 1200 it will just start
> > a whole new load of misplaced object rebalancing. Won't it?
> >
> > Thank you,
> > Ray
> >
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Wednesday, April 13, 2022 11:11 AM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > Hi, Thanks.
> >
> > norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> > the best option now to get the PGs to go active+clean and let the mon db
> > come back under control. Unset those before continuing.
> >
> > I think you need to set the pg_num for pool1 to something close to but
> > less than 926. (Or whatever the pg_num_target is when you run the command
> > below).
> > The idea is to let a few more merges complete successfully but then once
> > all PGs are active+clean to take a decision about the other interventions
> > you want to carry out.
> > So this ought to be good:
> > ceph osd pool set pool1 pg_num 920
> >
> > Then for pool7 this looks like splitting is ongoing. You should be able to
> > pause that by setting the pg_num to something just above 883.
> > I would do:
> > ceph osd pool set pool7 pg_num 890
> >
> > It may even be fastest to just set those pg_num values to exactly what the
> > current pgp_num_target is. You can try it.
> >
> > Once your cluster is stable again, then you should set those to the
> > nearest power of two.
> > Personally I would wait for #53729 to be fixed before embarking on future
> > pg_num changes.
> > (You'll have to mute a warning in the meantime -- check the docs after the
> > warning appears).
> >
> > Cheers, dan
> >
> > On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham <
> > ray.cunning...@keepertech.com> wrote:
> > >
> > > Perfect timing, I was just about to reply. We have disabled autoscaler
> > on all pools now.
> > >
> > > Unfortunately, I can't just copy and paste from this system...
> > >
> > > `ceph osd pool ls detail` only 2 pools have any difference.
> > > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
> > >
> > > ` ceph osd pool autoscale-status`
> > > Size is defined
> > > target size is empty
> > > Rate is 7 for all pools except pool7, which is 1.333730697632 Raw
> > > capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all
> > > others is 0 Target and Effective Ratio is empty Bias is 1.0 for all
> > > PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> > > New PG_NUM is empty
> > > Autoscale is now off for all
> > > Profile is scale-up
> > >
> > >
> > > We have set norebalance and nobackfill and are watching to see what
> > happens.
> > >
> > > Thank you,
> > > Ray
> > >
> > > -Original Message-
> > > From: Dan van der Ster 
> > > Sent: Wednesday, April 13, 2022 10:00 AM
> > > To: Ray Cunningham 
> > > Cc: ceph-users@ceph.io
> > > Subject: Re: [ceph-users] Stop Rebalancing
> > >
> > > One mor

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Ray Cunningham
Ok, so in our situation with high pg_num and a low pgp_num, is there any way we 
can make it stop backfilling temporarily? The system is already operating with 
different pg and pgp numbers, so I'm thinking it won't kill the cluster if we 
just set the pgp_num and make it stop splitting for the moment.. 

Thank you,
Ray 

-Original Message-
From: Gregory Farnum  
Sent: Wednesday, April 13, 2022 12:07 PM
To: Dan van der Ster 
Cc: Ray Cunningham ; Ceph Users 

Subject: Re: [ceph-users] Re: Stop Rebalancing

On Wed, Apr 13, 2022 at 10:01 AM Dan van der Ster  wrote:
>
> I would set the pg_num, not pgp_num. In older versions of ceph you 
> could manipulate these things separately, but in pacific I'm not 
> confident about what setting pgp_num directly will do in this exact scenario.
>
> To understand, the difference between these two depends on if you're 
> splitting or merging.
> First, definitions: pg_num is the number of PGs and pgp_num is the 
> number used for placing objects.
>
> So if pgp_num < pg_num, then at steady state only pgp_num pgs actually 
> store data, and the other pg_num-pgp_num PGs are sitting empty.

Wait, what? That's not right! pgp_num is pg *placement* number; it controls how 
we map PGs to OSDs. But the full pg still exists as its own thing on the OSD 
and has its own data structures and objects. If currently the cluster has 
reduced pgp_num it has changed the locations of PGs, but it hasn't merged any 
PGs together. Changing the pg_num and causing merges will invoke a whole new 
workload which can be pretty substantial.
-Greg

>
> To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer 
> pgs, then decreases pg_num as the PGs are emptied to actually delete 
> the now empty PGs.
>
> Splitting is similar but in reverse: first, Ceph creates new empty PGs 
> by increasing pg_num. Then it gradually increases pgp_num to start 
> sending data to the new PGs.
>
> That's the general idea, anyway.
>
> Long story short, set pg_num to something close to the current 
> pgp_num_target.
>
> .. Dan
>
>
> On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
> 
> wrote:
>
> > Thank you so much, Dan!
> >
> > Can you confirm for me that for pool7, which has 2048/2048 for 
> > pg_num and
> > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can 
> > they be different for a single pool, or does pg_num and pgp_num have 
> > to always be the same?
> >
> > IF we just set pgp_num to 890 we will have pg_num at 2048 and 
> > pgp_num at 890, is that ok? Because if we reduce the pg_num by 1200 
> > it will just start a whole new load of misplaced object rebalancing. Won't 
> > it?
> >
> > Thank you,
> > Ray
> >
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Wednesday, April 13, 2022 11:11 AM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > Hi, Thanks.
> >
> > norebalance/nobackfill are useful to pause ongoing backfilling, but 
> > aren't the best option now to get the PGs to go active+clean and let 
> > the mon db come back under control. Unset those before continuing.
> >
> > I think you need to set the pg_num for pool1 to something close to 
> > but less than 926. (Or whatever the pg_num_target is when you run 
> > the command below).
> > The idea is to let a few more merges complete successfully but then 
> > once all PGs are active+clean to take a decision about the other 
> > interventions you want to carry out.
> > So this ought to be good:
> > ceph osd pool set pool1 pg_num 920
> >
> > Then for pool7 this looks like splitting is ongoing. You should be 
> > able to pause that by setting the pg_num to something just above 883.
> > I would do:
> > ceph osd pool set pool7 pg_num 890
> >
> > It may even be fastest to just set those pg_num values to exactly 
> > what the current pgp_num_target is. You can try it.
> >
> > Once your cluster is stable again, then you should set those to the 
> > nearest power of two.
> > Personally I would wait for #53729 to be fixed before embarking on 
> > future pg_num changes.
> > (You'll have to mute a warning in the meantime -- check the docs 
> > after the warning appears).
> >
> > Cheers, dan
> >
> > On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham < 
> > ray.cunning...@keepertech.com> wrote:
> > >
> > > Perfect timing, I was just about to reply. We have disabled 
> > > autoscaler
> > on all pools now.
> > >
> > > Unfortunately, I can't just copy and paste from this system...
> > >
> > > `ceph osd pool ls detail` only 2 pools have any difference.
> > > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 
> > > 2048
> > >
> > > ` ceph osd pool autoscale-status`
> > > Size is defined
> > > target size is empty
> > > Rate is 7 for all pools except pool7, which is 1.333730697632 
> > > Raw capacity is defined Ratio for pool1 is .0177, pool7 is .4200 
> > > and all others is 0 

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Dan van der Ster
On Wed, Apr 13, 2022 at 7:07 PM Gregory Farnum  wrote:
>
> On Wed, Apr 13, 2022 at 10:01 AM Dan van der Ster  wrote:
> >
> > I would set the pg_num, not pgp_num. In older versions of ceph you could
> > manipulate these things separately, but in pacific I'm not confident about
> > what setting pgp_num directly will do in this exact scenario.
> >
> > To understand, the difference between these two depends on if you're
> > splitting or merging.
> > First, definitions: pg_num is the number of PGs and pgp_num is the number
> > used for placing objects.
> >
> > So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
> > store data, and the other pg_num-pgp_num PGs are sitting empty.
>
> Wait, what? That's not right! pgp_num is pg *placement* number; it
> controls how we map PGs to OSDs. But the full pg still exists as its
> own thing on the OSD and has its own data structures and objects. If
> currently the cluster has reduced pgp_num it has changed the locations
> of PGs, but it hasn't merged any PGs together. Changing the pg_num and
> causing merges will invoke a whole new workload which can be pretty
> substantial.

Eek, yes, I got this wrong. Somehow I imagined some orthogonal
implementation based on how it appears to work in practice.

In any case, isn't this still the best approach to make all PGs go
active+clean ASAP in this scenario?

1. turn off the autoscaler (for those pools, or fully)
2. for any pool with pg_num_target or pgp_num_target values, get the
current pgp_num X and use it to `ceph osd pool set  pg_num X`.

Can someone confirm that or recommend something different?

Cheers, Dan



> -Greg
>
> >
> > To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
> > then decreases pg_num as the PGs are emptied to actually delete the now
> > empty PGs.
> >
> > Splitting is similar but in reverse: first, Ceph creates new empty PGs by
> > increasing pg_num. Then it gradually increases pgp_num to start sending
> > data to the new PGs.
> >
> > That's the general idea, anyway.
> >
> > Long story short, set pg_num to something close to the current
> > pgp_num_target.
> >
> > .. Dan
> >
> >
> > On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
> > 
> > wrote:
> >
> > > Thank you so much, Dan!
> > >
> > > Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> > > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> > > different for a single pool, or does pg_num and pgp_num have to always be
> > > the same?
> > >
> > > IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> > > 890, is that ok? Because if we reduce the pg_num by 1200 it will just 
> > > start
> > > a whole new load of misplaced object rebalancing. Won't it?
> > >
> > > Thank you,
> > > Ray
> > >
> > >
> > > -Original Message-
> > > From: Dan van der Ster 
> > > Sent: Wednesday, April 13, 2022 11:11 AM
> > > To: Ray Cunningham 
> > > Cc: ceph-users@ceph.io
> > > Subject: Re: [ceph-users] Stop Rebalancing
> > >
> > > Hi, Thanks.
> > >
> > > norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> > > the best option now to get the PGs to go active+clean and let the mon db
> > > come back under control. Unset those before continuing.
> > >
> > > I think you need to set the pg_num for pool1 to something close to but
> > > less than 926. (Or whatever the pg_num_target is when you run the command
> > > below).
> > > The idea is to let a few more merges complete successfully but then once
> > > all PGs are active+clean to take a decision about the other interventions
> > > you want to carry out.
> > > So this ought to be good:
> > > ceph osd pool set pool1 pg_num 920
> > >
> > > Then for pool7 this looks like splitting is ongoing. You should be able to
> > > pause that by setting the pg_num to something just above 883.
> > > I would do:
> > > ceph osd pool set pool7 pg_num 890
> > >
> > > It may even be fastest to just set those pg_num values to exactly what the
> > > current pgp_num_target is. You can try it.
> > >
> > > Once your cluster is stable again, then you should set those to the
> > > nearest power of two.
> > > Personally I would wait for #53729 to be fixed before embarking on future
> > > pg_num changes.
> > > (You'll have to mute a warning in the meantime -- check the docs after the
> > > warning appears).
> > >
> > > Cheers, dan
> > >
> > > On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham <
> > > ray.cunning...@keepertech.com> wrote:
> > > >
> > > > Perfect timing, I was just about to reply. We have disabled autoscaler
> > > on all pools now.
> > > >
> > > > Unfortunately, I can't just copy and paste from this system...
> > > >
> > > > `ceph osd pool ls detail` only 2 pools have any difference.
> > > > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > > > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
> > > >
> > > > ` ceph osd pool autoscale-status`
> > > > Size is de

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Anthony D'Atri


> In any case, isn't this still the best approach to make all PGs go
> active+clean ASAP in this scenario?
> 
> 1. turn off the autoscaler (for those pools, or fully)
> 2. for any pool with pg_num_target or pgp_num_target values, get the
> current pgp_num X and use it to `ceph osd pool set  pg_num X`.
> 
> Can someone confirm that or recommend something different?

FWIW that’s what I would do.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Ray Cunningham
We've done that, I'll update with what happens overnight. Thanks everyone!


Thank you,

Ray


From: Anthony D'Atri 
Sent: Wednesday, April 13, 2022 4:49 PM
To: Ceph Users 
Subject: [ceph-users] Re: Stop Rebalancing



> In any case, isn't this still the best approach to make all PGs go
> active+clean ASAP in this scenario?
>
> 1. turn off the autoscaler (for those pools, or fully)
> 2. for any pool with pg_num_target or pgp_num_target values, get the
> current pgp_num X and use it to `ceph osd pool set  pg_num X`.
>
> Can someone confirm that or recommend something different?

FWIW that’s what I would do.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Neha Ojha
For the moment, Dan's workaround sounds good to me, but I'd like to
understand how we got here, in terms of the decisions that were made
by the autoscaler.
We have a config option called "target_max_misplaced_ratio" (default
value is 0.05), which is supposed to limit the number of misplaced
objects in the cluster to 5% of the total. Ray, in your case, does
that seem to have worked, given that you have ~1.3 billion misplaced
objects?

In any case, let's use https://tracker.ceph.com/issues/55303 to
capture some more debug data that can help us understand the actions
of the autoscaler. To start with, it would be helpful if you could
attach the cluster and audit logs, output of ceph -s, ceph df along
with the output of ceph osd pool autoscale-status and ceph osd pool ls
detail. Junior (Kamoltat), is there anything else that will be useful
to capture to get to the bottom of this?

Just for future reference, 16.2.8 and quincy, will include a
"noautoscale" cluster-wide flag, which can be used to disable auto
scaling across pools, during maintenance periods.

Thanks,
Neha


On Wed, Apr 13, 2022 at 1:58 PM Ray Cunningham
 wrote:
>
> We've done that, I'll update with what happens overnight. Thanks everyone!
>
>
> Thank you,
>
> Ray
>
> 
> From: Anthony D'Atri 
> Sent: Wednesday, April 13, 2022 4:49 PM
> To: Ceph Users 
> Subject: [ceph-users] Re: Stop Rebalancing
>
>
>
> > In any case, isn't this still the best approach to make all PGs go
> > active+clean ASAP in this scenario?
> >
> > 1. turn off the autoscaler (for those pools, or fully)
> > 2. for any pool with pg_num_target or pgp_num_target values, get the
> > current pgp_num X and use it to `ceph osd pool set  pg_num X`.
> >
> > Can someone confirm that or recommend something different?
>
> FWIW that’s what I would do.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-13 Thread Venky Shankar
On Mon, Apr 11, 2022 at 7:33 PM Venky Shankar  wrote:
>
> On Fri, Apr 8, 2022 at 3:32 PM Venky Shankar  wrote:
> >
> > On Tue, Apr 5, 2022 at 7:44 AM Venky Shankar  wrote:
> > >
> > > Hey Josh,
> > >
> > > On Tue, Apr 5, 2022 at 4:34 AM Josh Durgin  wrote:
> > > >
> > > > Hi Venky and Ernesto, how are the mount fix and grafana container build 
> > > > looking?
> > >
> > > Currently running into various teuthology related issues when testing
> > > out the mount fix.
> > >
> > > We'll want a test run without these failures to be really sure that we
> > > aren't missing anything.
> >
> > Update: The unrelated failures have been taken care of (updates to
> > testing kernel).  Seeing one failed test with the following PR:
> >
> > https://github.com/ceph/ceph/pull/45689
> >
> > We are working on priority to get that resolved.
>
> PR merged into master.
>
> Yuri, FYI - quincy backport PR is updated:
> https://github.com/ceph/ceph/pull/45780

Merged into quincy.

>
> >
> > >
> > > >
> > > > Josh
> > > >
> > > >
> > > > On Fri, Apr 1, 2022 at 8:22 AM Venky Shankar  
> > > > wrote:
> > > >>
> > > >> On Thu, Mar 31, 2022 at 8:51 PM Venky Shankar  
> > > >> wrote:
> > > >> >
> > > >> > Hi Yuri,
> > > >> >
> > > >> > On Wed, Mar 30, 2022 at 11:24 PM Yuri Weinstein 
> > > >> >  wrote:
> > > >> > >
> > > >> > > We merged rgw, cephadm and core PRs, but some work is still 
> > > >> > > pending on fs and dashboard components.
> > > >> > >
> > > >> > > Seeking approvals for:
> > > >> > >
> > > >> > > smoke - Venky
> > > >> > > fs - Venky
> > > >> >
> > > >> > I approved the latest batch for cephfs PRs:
> > > >> > https://trello.com/c/Iq3WtUK5/1494-wip-yuri-testing-2022-03-29-0741-quincy
> > > >> >
> > > >> > There is one pending (blocker) PR:
> > > >> > https://github.com/ceph/ceph/pull/45689 - I'll let you know when the
> > > >> > backport is available.
> > > >>
> > > >> Smoke test passes with the above PR:
> > > >> https://pulpito.ceph.com/vshankar-2022-04-01_12:29:01-smoke-wip-vshankar-testing1-20220401-123425-testing-default-smithi/
> > > >>
> > > >> Requested Yuri to run FS suite w/ master (jobs were not getting
> > > >> scheduled in my run). Thanks, Yuri!
> > > >>
> > > >> ___
> > > >> ceph-users mailing list -- ceph-users@ceph.io
> > > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >>
> > >
> > >
> > > --
> > > Cheers,
> > > Venky
> >
> >
> >
> > --
> > Cheers,
> > Venky
>
>
>
> --
> Cheers,
> Venky



-- 
Cheers,
Venky

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io