[ceph-users] Re: Multisite: metadata behind on shards
On 13.05.24 5:26 AM, Szabo, Istvan (Agoda) wrote: Wonder what is the mechanism behind the sync mechanism because I need to restart all the gateways every 2 days on the remote sites to keep those it in sync. (Octopus 15.2.7) We've also seen lots of those issues with stuck RGWs with earlier versions. But there have been lots of fixes in this area ... e.g. https://tracker.ceph.com/issues/39657 Is upgrading Ceph to a more recent version an option for you? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: reef 18.2.3 QE validation status
On 18.04.24 8:13 PM, Laura Flores wrote: Thanks for bringing this to our attention. The leads have decided that since this PR hasn't been merged to main yet and isn't approved, it will not go in v18.2.3, but it will be prioritized for v18.2.4. I've already added the PR to the v18.2.4 milestone so it's sure to be picked up. Thanks a bunch. If you miss the train, you miss the train - fair enough. Nice to know there is another one going soon and that bug is going to be on it ! Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: reef 18.2.3 QE validation status
Hey Laura, On 17.04.24 4:58 PM, Laura Flores wrote: There are two PRs that were added later to the 18.2.3 milestone concerning debian packaging: https://github.com/ceph/ceph/pulls?q=is%3Apr+is%3Aopen+milestone%3Av18.2.3 The user is asking if these can be included. I know everybody always wants their most anticipated PR in the next point release, but please let me kindly point you to the issue of ceph-crash not working due to some small glitch it's directory permissions: * ML post to the ML https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VACLBNVXTYNSXJSNXJSRAQNZHCHABDF4/ * Bug Report: https://github.com/ceph/ceph/pull/55917 * Non-backport PR fixing this: https://tracker.ceph.com/issues/64548 Since this is really potentially a one liner fix allowing for ceph-crash reports to be sent again. When I noticed this, I had 47 non-reported crashes queues up in one my clusters. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw s3 bucket policies limitations (on users)
Hey Garcetto, On 29.03.24 4:13 PM, garcetto wrote: i am trying to set bucket policies to allow to different users to access same bucket with different permissions, BUT it seems that is not yet supported, am i wrong? https://docs.ceph.com/en/reef/radosgw/bucketpolicy/#limitations "We do not yet support setting policies on users, groups, or roles." Maybe check out my previous, somewhat similar question: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/S2TV7GVFJTWPYA6NVRXDL2JXYUIQGMIN/ And PR https://github.com/ceph/ceph/pull/44434 could also be of interest. I would love for RGW to support more detailed bucket policies, especially with external / Keystone authentication. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hanging request in S3
Hi Casey, Interesting. Especially since the request it hangs on is a GET request. I set the option and restarted the RGW I test with. The POSTs for deleting take a while but there are not longer blocking GET or POST requests. Thank you! Best, Christian PS: Sorry for pressing the wrong reply button, Casey ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Journal size recommendations
On 01.03.22 19:57, Eugen Block wrote: can you be more specific what exactly you are looking for? Are you talking about the rocksDB size? And what is the unit for 5012? It’s really not clear to me what you’re asking. And since the recommendations vary between different use cases you might want to share more details about your use case. FWIW, I suppose OP was asking about this setting: https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_journal_size And reading https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#journal-settings states "This section applies only to the older Filestore OSD back end. Since Luminous BlueStore has been default and preferred." It's totally obsolete with bluestore. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw dynamic bucket sharding will hang io
On 08.03.24 14:25, Christian Rohmann wrote: What do you mean by blocking IO? No bucket actions (read / write) or high IO utilization? According to https://docs.ceph.com/en/latest/radosgw/dynamicresharding/ "Writes to the target bucket are blocked (but reads are not) briefly during resharding process." Are you observing this not being that "briefly" then? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw dynamic bucket sharding will hang io
On 08.03.24 07:22, nuabo tan wrote: When reshard occurs, io will be blocked, why has this serious problem not been solved? Do you care to elaborate on this a bit more? Which Ceph release are you using? Are you using multisite replication or are you talking about a single RGW site? What do you mean by blocking IO? No bucket actions (read / write) or high IO utilization? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Hanging request in S3
6 Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket get_obj_state: setting s->obj_tag to 107ace7a-a829-4d1c-9cb8-9db30644b786.395658.12884446303569321109 Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket bucket index object: rechenzentrum.rgw.buckets.index:.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3.1.34 Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket cache get: name=rechenzentrum.rgw.log++bucket.sync-source-hints.sql20 : hit (negative entry) Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket cache get: name=rechenzentrum.rgw.log++bucket.sync-target-hints.sql20 : hit (negative entry) Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket reflect(): flow manager (bucket=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3): adding source pipe: {s={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0},d={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0}} Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket reflect(): flow manager (bucket=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3): adding dest pipe: {s={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0},d={b=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0}} Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket reflect(): flow manager (bucket=): adding source pipe: {s={b=*,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0},d={b=*,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0}} Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket reflect(): flow manager (bucket=): adding dest pipe: {s={b=*,z=107ace7a-a829-4d1c-9cb8-9db30644b786,az=0},d={b=*,z=3caabb9a-4e3b-4b8a-8222-34c33dd63210,az=0}} Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket chain_cache_entry: cache_locator= Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket chain_cache_entry: couldn't find cache locator Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket couldn't put bucket_sync_policy cache entry, might have raced with data changes Mär 06 19:36:17 radosgw[8318]: req 13321243250692796422 0.336010247s s3:list_bucket RGWDataChangesLog::add_entry() bucket.name=sql20 shard_id=34 now=2024-03-06T18:36:17.978389+ cur_expiration=1970-01-01T00:00:00.00+ I don't see any clear error but somehow the last view lines are odd to me: - When before it said: flow manager (bucket=sql20:3caabb9a-4e3b-4b8a-8222-34c33dd63210.10724501.3) it has no more bucket: flow manager (bucket=) - no cache locator found. No idea if this is okay or not - The cur_expiration a few lines later is set to unix time 0 (1970-01-01T00:00:00.00+) - I did this multiple times and it seems to always be shard 34 that has the issue Did someone see something like this before? Any ideas how to remedy the situation or at least where to or what to look for? Best, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: debian-reef_OLD?
On 04.03.24 22:24, Daniel Brown wrote: debian-reef/ Now appears to be: debian-reef_OLD/ Could this have been some sort of "release script" just messing up the renaming / symlinking to the most recent stable? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)
On 23.02.24 16:18, Christian Rohmann wrote: I just noticed issues with ceph-crash using the Debian /Ubuntu packages (package: ceph-base): While the /var/lib/ceph/crash/posted folder is created by the package install, it's not properly chowned to ceph:ceph by the postinst script. [...] You might want to check if you might be affected as well. Failing to post crashes to the local cluster results in them not being reported back via telemetry. Sorry to bluntly bump this again, but did nobody else notice this on your clusters? Call me egoistic, but the more clusters return crash reports the more stable my Ceph likely becomes ;-) Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)
Hey ceph-users, I just noticed issues with ceph-crash using the Debian /Ubuntu packages (package: ceph-base): While the /var/lib/ceph/crash/posted folder is created by the package install, it's not properly chowned to ceph:ceph by the postinst script. This might also affect RPM based installs somehow, but I did not look into that. I opened a bug report with all the details and two ideas to fix this: https://tracker.ceph.com/issues/64548 The wrong ownership causes ceph-crash to NOT work at all. I myself missed quite a few crash reports. All of them were just sitting around on the machines, but were reported right after I did chown ceph:ceph /var/lib/ceph/crash/posted systemctl restart ceph-crash.service You might want to check if you might be affected as well. Failing to post crashes to the local cluster results in them not being reported back via telemetry. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef
On 01.02.24 10:10, Christian Rohmann wrote: [...] I am wondering if ceph-exporter ([2] is also built and packaged via the ceph packages [3] for installations that use them? [2] https://github.com/ceph/ceph/tree/main/src/exporter [3] https://docs.ceph.com/en/latest/install/get-packages/ I could not find ceph-exporter in any of the packages or as single binary, so I opened an issue: https://tracker.ceph.com/issues/64317 Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: how can install latest dev release?
On 31.01.24 11:33, garcetto wrote: thank you, but seems related to quincy, there is nothing on latest vesions in the doc...maybe the doc is not updated? I don't understand what you are missing. I just used a documentation link pointing to the Quincy version of this page, yes. The "latest" documentation is at https://docs.ceph.com/en/latest/install/get-packages/#ceph-development-packages. But it seems nothing has changed. There are dev packages available at the URLs mentioned there. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef
This change is documented at https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics, also mentioning the deployment of ceph-exporter which is now used to collect per-host metrics from the local daemons. While this deployment is done by cephadm if used, I am wondering if ceph-exporter ([2] is also built and packaged via the ceph packages [3] for installations that use them? Regards Christian [1] https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics [2] https://github.com/ceph/ceph/tree/main/src/exporter [3] https://docs.ceph.com/en/latest/install/get-packages/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: how can install latest dev release?
On 31.01.24 09:38, garcetto wrote: how can i install latest dev release using cephadm? I suppose you found https://docs.ceph.com/en/quincy/install/get-packages/#ceph-development-packages, but yes, that only seems to target a package installation. Would be nice if there were also dev containers being built somewhere to use with cephadm. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 3 DC with 4+5 EC not quite working
I could be wrong however as far as I can see you have 9 chunks which requires 9 failure domains. Your failure domain is set to datacenter which you only have 3 of. So that won't work. You need to set your failure domain to host and then create a crush rule to choose a DC and choose 3 hosts within each DC Something like this should work: step choose indep 3 type datacenter step chooseleaf indep 3 type host On Fri, 12 Jan 2024 at 20:58, Torkil Svensgaard wrote: > We are looking to create a 3 datacenter 4+5 erasure coded pool but can't > quite get it to work. Ceph version 17.2.7. These are the hosts (there > will eventually be 6 hdd hosts in each datacenter): > > -33 886.00842 datacenter 714 > -7 209.93135 host ceph-hdd1 > > -69 69.86389 host ceph-flash1 > -6 188.09579 host ceph-hdd2 > > -3 233.57649 host ceph-hdd3 > > -12 184.54091 host ceph-hdd4 > -34 824.47168 datacenter DCN > -73 69.86389 host ceph-flash2 > -2 201.78067 host ceph-hdd5 > > -81 288.26501 host ceph-hdd6 > > -31 264.56207 host ceph-hdd7 > > -36 1284.48621 datacenter TBA > -77 69.86389 host ceph-flash3 > -21 190.83224 host ceph-hdd8 > > -29 199.08838 host ceph-hdd9 > > -11 193.85382 host ceph-hdd10 > > -9 237.28154 host ceph-hdd11 > > -26 187.19536 host ceph-hdd12 > > -4 206.37102 host ceph-hdd13 > > We did this: > > ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd > plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default > crush-failure-domain=datacenter crush-device-class=hdd > > ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd > ceph osd pool set cephfs.hdd.data allow_ec_overwrites true > ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn > > Didn't quite work: > > " > [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg > incomplete > pg 33.0 is creating+incomplete, acting > [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool > cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for > 'incomplete') > " > > I then manually changed the crush rule from this: > > " > rule cephfs.hdd.data { > id 7 > type erasure > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class hdd > step chooseleaf indep 0 type datacenter > step emit > } > " > > To this: > > " > rule cephfs.hdd.data { > id 7 > type erasure > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class hdd > step choose indep 0 type datacenter > step chooseleaf indep 3 type host > step emit > } > " > > Based on some testing and dialogue I had with Red Hat support last year > when we were on RHCS, and it seemed to work. Then: > > ceph fs add_data_pool cephfs cephfs.hdd.data > ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data > > I started copying data to the subvolume and increased pg_num a couple of > times: > > ceph osd pool set cephfs.hdd.data pg_num 256 > ceph osd pool set cephfs.hdd.data pg_num 2048 > > But at some point it failed to activate new PGs eventually leading to this: > > " > [WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs > mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are > blocked > 30 secs, oldest blocked for 25455 secs > [WARN] MDS_TRIM: 1 MDSs behind on trimming > mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming > (997/128) max_segments: 128, num_segments: 997 > [WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive > pg 33.6f6 is stuck inactive for 8h, current state > activating+remapped, last acting [50,79,116,299,98,219,164,124,421] > pg 33.6fa is stuck inactive for 11h, current state > activating+undersized+degraded+remapped, last acting > [17,408,NONE,196,223,290,73,39,11] > pg 33.705 is stuck inactive for 11h, current state > activating+undersized+degraded+remapped, last acting > [33,273,71,NONE,411,96,28,7,161] > pg 33.721 is stuck inactive for 7h, current state > activating+remapped, last acting [283,150,209,423,103,325,118,142,87] > pg 33.726 is stuck inactive for 11h, current state > activating+undersized+degraded+remapped, last acting > [234,NONE,416,121,54,141,277,265,19] > [WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects > degraded (0.000%), 3 pgs degraded, 3 pgs undersized > pg 33.6fa is stuck undersized for 7h, current state > activating+undersized+degraded+remapped, last acting > [17,408,NONE,196,223,290,73,39,11] > pg 33.705 is stuck undersized for 7h, current state >
[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures
Hey Istvan, On 10.01.24 03:27, Szabo, Istvan (Agoda) wrote: I'm using in the frontend https config on haproxy like this, it works so far good: stick-table type ip size 1m expire 10s store http_req_rate(10s) tcp-request inspect-delay 10s tcp-request content track-sc0 src http-request deny deny_status 429 if { sc_http_req_rate(0) gt 1 } But this serves as a basic rate limit for all request coming from a single IP address, right? My question was rather about limiting clients in regards to authentication requests / unauthorized requests, which end up hammering the auth system (Keystone in my case) at full rate. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures
Happy New Year Ceph-Users! With the holidays and people likely being away, I take the liberty to bluntly BUMP this question about protecting RGW from DoS below: On 22.12.23 10:24, Christian Rohmann wrote: Hey Ceph-Users, RGW does have options [1] to rate limit ops or bandwidth per bucket or user. But those only come into play when the request is authenticated. I'd like to also protect the authentication subsystem from malicious or invalid requests. So in case e.g. some EC2 credentials are not valid (anymore) and clients start hammering the RGW with those requests, I'd like to make it cheap to deal with those requests. Especially in case some external authentication like OpenStack Keystone [2] is used, valid access tokens are cached within the RGW. But requests with invalid credentials end up being sent at full rate to the external API [3] as there is no negative caching. And even if there was, that would only limit the external auth requests for the same set of invalid credentials, but it would surely reduce the load in that case: Since the HTTP request is blocking [...] 2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to https://keystone.example.com/v3/s3tokens 2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0 2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http request 2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0 [...] this does not only stress the external authentication API (keystone in this case), but also blocks RGW threads for the duration of the external call. I am currently looking into using the capabilities of HAProxy to rate limit requests based on their resulting http-response [4]. So in essence to rate-limit or tarpit clients that "produce" a high number of 403 "InvalidAccessKeyId" responses. To have less collateral it might make sense to limit based on the presented credentials themselves. But this would require to extract and track HTTP headers or URL parameters (presigned URLs) [5] and to put them into tables. * What are your thoughts on the matter? * What kind of measures did you put in place? * Does it make sense to extend RGWs capabilities to deal with those cases itself? ** adding negative caching ** rate limits on concurrent external authentication requests (or is there a pool of connections for those requests?) Regards Christian [1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management [2] https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone [3] https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475 [4] https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0 [5] https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm - podman vs docker
General complaint about docker is usually that it by default stops all running containers when the docker daemon gets shutdown. There is the "live-restore" option (which has been around for a while) but that's turned off by default (and requires a daemon restart to enable). It only supports patch updates (no major version upgrades) though that might be sufficient for you. On Thu, 28 Dec 2023 at 03:30, Murilo Morais wrote: > Good morning everybody! > > Guys, are there any differences or limitations when using Docker instead of > Podman? > > Context: I have a cluster with Debian 11 running Podman (3.0.1), but the > iSCSI service, when restarted, the "tcmu-runner" binary is in "Z State" and > the "rbd-target-api" script enters "D State" and never dies, causing the > service not to start until I perform a reboot. On machines that use > distributions based on Red Hat with podman 4+ this behavior does not > happen. > > I don't want to use a repository that I don't know about just to update > podman. > > I haven't tested it with Debian 12 yet, as we experienced some problems > with bootstrap, so we decided to use Debian 11. > > I'm thinking about testing with Docker but I don't know what the difference > is between both solutions in the CEPH context. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures
Hey Ceph-Users, RGW does have options [1] to rate limit ops or bandwidth per bucket or user. But those only come into play when the request is authenticated. I'd like to also protect the authentication subsystem from malicious or invalid requests. So in case e.g. some EC2 credentials are not valid (anymore) and clients start hammering the RGW with those requests, I'd like to make it cheap to deal with those requests. Especially in case some external authentication like OpenStack Keystone [2] is used, valid access tokens are cached within the RGW. But requests with invalid credentials end up being sent at full rate to the external API [3] as there is no negative caching. And even if there was, that would only limit the external auth requests for the same set of invalid credentials, but it would surely reduce the load in that case: Since the HTTP request is blocking [...] 2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to https://keystone.example.com/v3/s3tokens 2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0 2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http request 2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0 [...] this does not only stress the external authentication API (keystone in this case), but also blocks RGW threads for the duration of the external call. I am currently looking into using the capabilities of HAProxy to rate limit requests based on their resulting http-response [4]. So in essence to rate-limit or tarpit clients that "produce" a high number of 403 "InvalidAccessKeyId" responses. To have less collateral it might make sense to limit based on the presented credentials themselves. But this would require to extract and track HTTP headers or URL parameters (presigned URLs) [5] and to put them into tables. * What are your thoughts on the matter? * What kind of measures did you put in place? * Does it make sense to extend RGWs capabilities to deal with those cases itself? ** adding negative caching ** rate limits on concurrent external authentication requests (or is there a pool of connections for those requests?) Regards Christian [1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management [2] https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone [3] https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475 [4] https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0 [5] https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
You can structure your crush map so that you get multiple EC chunks per host in a way that you can still survive a host outage outage even though you have fewer hosts than k+1 For example if you run an EC=4+2 profile on 3 hosts you can structure your crushmap so that you have 2 chunks per host. This means even if one host is down you are still guaranteed to have 4 chunks available. If you then set min_size = 4 you can still operate your cluster in that situation - albeit risky since any additional failure in that time will lead to data loss. However in a highly constrained setup it might be a trade-off that's worth it for you. There have been examples of this on this mailing list in the past. On Wed, 6 Dec 2023 at 12:11, Rich Freeman wrote: > On Tue, Dec 5, 2023 at 6:35 AM Patrick Begou > wrote: > > > > Ok, so I've misunderstood the meaning of failure domain. If there is no > > way to request using 2 osd/node and node as failure domain, with 5 nodes > > k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a > > raid1 setup. A little bit better than replication in the point of view > > of global storage capacity. > > > > I'm not sure what you mean by requesting 2osd/node. If the failure > domain is set to the host, then by default k/m refer to hosts, and the > PGs will be spread across all OSDs on all hosts, but with any > particular PG only being present on one OSD on each host. You can get > fancy with device classes and crush rules and such and be more > specific with how they're allocated, but that would be the typical > behavior. > > Since k/m refer to hosts, then k+m must be less than or equal to the > number of hosts or you'll have a degraded pool because there won't be > enough hosts to allocate them all. It won't ever stack them across > multiple OSDs on the same host with that configuration. > > k=2,m=2 with min=3 would require at least 4 hosts (k+m), and would > allow you to operate degraded with a single host down, and the PGs > would become inactive but would still be recoverable with two hosts > down. While strictly speaking only 4 hosts are required, you'd do > better to have more than that since then the cluster can immediately > recover from a loss, assuming you have sufficient space. As you say > it is no more space-efficient than RAID1 or size=2, and it suffers > write amplification for modifications, but it does allow recovery > after the loss of up to two hosts, and you can operate degraded with > one host down which allows for somewhat high availability. > > -- > Rich > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Automatic triggering of the Ubuntu SRU process, e.g. for the recent 17.2.7 Quincy point release?
Hey Yuri, hey ceph-users, first of all, thanks for all your work on developing and maintaining Ceph. I was just wondering if there was any sort of process or trigger to the Ubuntu release team following a point release, for them to also create updated packages. If you look at https://packages.ubuntu.com/jammy-updates/ceph, there still only is 17.2.6 as the current update available. There was an [SRU] bug raised for 17.2.6 (https://bugs.launchpad.net/cloud-archive/+bug/2018929), I now opened a similar one (https://bugs.launchpad.net/cloud-archive/+bug/2043336) hoping I went the right way of triggering the packaging this point release. Even though the Ceph team does not build Quincy packages for Ubuntu 22.04 LTS (Jammy) themselves, it would be nice to still treat it somewhat of as a release channel and to automatically trigger these kind of processes. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Packages for 17.2.7 released without release notes / announcement (Re: Re: Status of Quincy 17.2.5 ?)
Sorry to dig up this old thread ... On 25.01.23 10:26, Christian Rohmann wrote: On 20/10/2022 10:12, Christian Rohmann wrote: 1) May I bring up again my remarks about the timing: On 19/10/2022 11:46, Christian Rohmann wrote: I believe the upload of a new release to the repo prior to the announcement happens quite regularly - it might just be due to the technical process of releasing. But I agree it would be nice to have a more "bit flip" approach to new releases in the repo and not have the packages appear as updates prior to the announcement and final release and update notes. By my observations sometimes there are packages available on the download servers via the "last stable" folders such as https://download.ceph.com/debian-quincy/ quite some time before the announcement of a release is out. I know it's hard to time this right with mirrors requiring some time to sync files, but would be nice to not see the packages or have people install them before there are the release notes and potential pointers to changes out. Todays 16.2.11 release shows the exact issue I described above 1) 16.2.11 packages are already available via e.g. https://download.ceph.com/debian-pacific 2) release notes not yet merged: (https://github.com/ceph/ceph/pull/49839), thus https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 404 :-) 3) No announcement like https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3D3ZTB5C5BJT5WRSJL5CVSD/ to the ML yet. I really appreciate the work (implementation and also testing) that goes into each release. But the release of 17.2.7 showed the issue of "packages available before the news is out": 1) packages are available on e.g. download.ceph.com 2) There are NO release notes on at https://docs.ceph.com/en/latest/releases/ yet 3) And there is no announcement on the ML yet It would be awesome if you could consider bit-flip releases with packages only available right with the communication / release notes. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware recommendations for a Ceph cluster
On Mon, 9 Oct 2023 at 14:24, Anthony D'Atri wrote: > > > > AFAIK the standing recommendation for all flash setups is to prefer fewer > > but faster cores > > Hrm, I think this might depend on what you’re solving for. This is the > conventional wisdom for MDS for sure. My sense is that OSDs can use > multiple cores fairly well, so I might look at the cores * GHz product. > Especially since this use-case sounds like long-tail performance probably > isn’t worth thousands. Only four OSD servers, Neutron, Kingston. I don’t > think the OP has stated any performance goals other than being more > suitable to OpenStack instances than LFF spinners. > Well, the 75F3 seems to retail for less than the 7713P, so it should technically be cheaper but then availability and supplier quotes are always an important factor. > > > so something like a 75F3 might be yielding better latency. > > Plus you probably want to experiment with partitioning the NVMEs and > > running multiple OSDs per drive - either 2 or 4. > > Mark Nelson has authored a series of blog posts that explore this in great > detail over a number of releases. TL;DR: with Quincy or Reef, especially, > my sense is that multiple OSDs per NVMe device is not the clear win that it > once was, and just eats more RAM. Mark has also authored detailed posts > about OSD performance vs cores per OSD, though IIRC those are for one OSD > in isolation. In a real-world cluster, especially one this small, I > suspect that replication and the network will be bottlenecks before either > of the factors discussed above. > > Thanks for reminding me of those. One thing I'm missing from https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ is the NVMe utilization - no point in buying NVMe that are blazingly fast (in terms sustained of random 4k IOPS performance) if you have no chance to actually utilize it. In summary it seems - if you have many cores then multiple OSD/NVME would provide a benefit, with fewer cores not so much. Still, it would also be good to see the same benchmark with a faster CPU (but less cores) and see what the actual difference is but I guess duplicating the test setup with a different CPU is a bit tricky budget-wsie. > ymmv. > > > > > > > On Sat, 7 Oct 2023 at 08:23, Gustavo Fahnle wrote: > > > >> Hi, > >> > >> Currently, I have an OpenStack installation with a Ceph cluster > consisting > >> of 4 servers for OSD, each with 16TB SATA HDDs. My intention is to add a > >> second, independent Ceph cluster to provide faster disks for OpenStack > VMs. > >> The idea for this second cluster is to exclusively provide RBD services > to > >> OpenStack. I plan to start with a cluster composed of 3 mon/mgr nodes > >> similar to what we currently have (3 virtualized servers with VMware) > with > >> 4 cores, 8GB of memory, 80GB disk and 10GB network > >> each server. > >> In the current cluster, these nodes have low resource consumption, less > >> than 10% CPU usage, 40% memory usage, and less than 100Mb/s of network > >> usage. > >> > >> For the OSDs, I'm thinking of starting with 3 or 4 servers, specifically > >> Supermicro AS-1114S-WN10RT, each with: > >> > >> 1 AMD EPYC 7713P Gen 3 processor (64 Core, 128 Threads, 2.0GHz) > >> 256GB of RAM > >> 2 x NVME 1TB for the operating system > >> 10 x NVME Kingston DC1500M U.2 7.68TB for the OSDs > >> Two Intel NIC E810-XXVDA2 25GbE Dual Port (2 x SFP28) PCIe 4.0 x8 cards > >> Connected to 2 MikroTik CRS518-16XS-2XQ-RM switches at 100GbE per server > >> Connection to OpenStack would be via 4 x 10GB to our core switch. > >> > >> I would like to hear opinions about this configuration, recommendations, > >> criticisms, etc. > >> > >> If any of you have references or experience with any of the components > in > >> this initial configuration, they would be very welcome. > >> > >> Thank you very much in advance. > >> > >> Gustavo Fahnle > >> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware recommendations for a Ceph cluster
AFAIK the standing recommendation for all flash setups is to prefer fewer but faster cores, so something like a 75F3 might be yielding better latency. Plus you probably want to experiment with partitioning the NVMEs and running multiple OSDs per drive - either 2 or 4. On Sat, 7 Oct 2023 at 08:23, Gustavo Fahnle wrote: > Hi, > > Currently, I have an OpenStack installation with a Ceph cluster consisting > of 4 servers for OSD, each with 16TB SATA HDDs. My intention is to add a > second, independent Ceph cluster to provide faster disks for OpenStack VMs. > The idea for this second cluster is to exclusively provide RBD services to > OpenStack. I plan to start with a cluster composed of 3 mon/mgr nodes > similar to what we currently have (3 virtualized servers with VMware) with > 4 cores, 8GB of memory, 80GB disk and 10GB network > each server. > In the current cluster, these nodes have low resource consumption, less > than 10% CPU usage, 40% memory usage, and less than 100Mb/s of network > usage. > > For the OSDs, I'm thinking of starting with 3 or 4 servers, specifically > Supermicro AS-1114S-WN10RT, each with: > > 1 AMD EPYC 7713P Gen 3 processor (64 Core, 128 Threads, 2.0GHz) > 256GB of RAM > 2 x NVME 1TB for the operating system > 10 x NVME Kingston DC1500M U.2 7.68TB for the OSDs > Two Intel NIC E810-XXVDA2 25GbE Dual Port (2 x SFP28) PCIe 4.0 x8 cards > Connected to 2 MikroTik CRS518-16XS-2XQ-RM switches at 100GbE per server > Connection to OpenStack would be via 4 x 10GB to our core switch. > > I would like to hear opinions about this configuration, recommendations, > criticisms, etc. > > If any of you have references or experience with any of the components in > this initial configuration, they would be very welcome. > > Thank you very much in advance. > > Gustavo Fahnle > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CVE-2023-43040 - Improperly verified POST keys in Ceph RGW?
Hey Ceph-users, i just noticed there is a post to oss-security (https://www.openwall.com/lists/oss-security/2023/09/26/10) about a security issue with Ceph RGW. Signed by IBM / Redhat and including a patch by DO. I also raised an issue on the tracker (https://tracker.ceph.com/issues/63004) about this, as I could not find one yet. It seems a weird way of disclosing such a thing and am wondering if anybody knew any more about this? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] What is causing *.rgw.log pool to fill up / not be expired (Re: RGW multisite logs (data, md, bilog) not being trimmed automatically?)
I am unfortunately still observing this issue of the RADOS pool "*.rgw.log" filling up with more and more objects: On 26.06.23 18:18, Christian Rohmann wrote: On the primary cluster I am observing an ever growing (objects and bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which is only 300MiB and around 15k objects with no growth. Metrics show that the growth of pool on primary is linear for at least 6 months, so not sudden spikes or anything. Also sync status appears to be totally happy. There are also no warnings in regards to large OMAPs or anything similar. Could anybody kindly point me into the right direction to search for the cause of this? What kinds of logs and data are stored in this pool? Thanks and with kind regards, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Contionuous spurious repairs without cause?
Hi, interesting, that’s something we can definitely try! Thanks! Christian > On 5. Sep 2023, at 16:37, Manuel Lausch wrote: > > Hi, > > in older versions of ceph with the auto-repair feature the PG state of > scrubbing PGs had always the repair state as well. > With later versions (I don't know exactly at which version) ceph > differentiated scrubbing and repair again in the PG state. > > I think as long as there are no errors loged all should be fine. If > you disable auto repair, the issue should disapear as well. In case of > scrub errors you will then see appropriate states. > > Regards > Manuel > > On Tue, 05 Sep 2023 14:14:56 + > Eugen Block wrote: > >> Hi, >> >> it sounds like you have auto-repair enabled (osd_scrub_auto_repair). I >> guess you could disable that to see what's going on with the PGs and >> their replicas. And/or you could enable debug logs. Are all daemons >> running the same ceph (minor) version? I remember a customer case >> where different ceph minor versions (but overall Octopus) caused >> damaged PGs, a repair fixed them everytime. After they updated all >> daemons to the same minor version those errors were gone. >> >> Regards, >> Eugen >> >> Zitat von Christian Theune : >> >>> Hi, >>> >>> this is a bit older cluster (Nautilus, bluestore only). >>> >>> We’ve noticed that the cluster is almost continuously repairing PGs. >>> However, they all finish successfully with “0 fixed”. We do not see >>> the trigger why Ceph decides to repair the PGs and it’s happening >>> for a lot of PGs, not any specific individual one. >>> >>> Deep-scrubs are generally running, but currently a bit late as we >>> had some recoveries in the last week. >>> >>> Logs look regular aside from the number of repairs. Here’s the last >>> weeks from the perspective of a single PG. There’s one repair, but >>> the same thing seems to happen for all PGs. >>> >>> 2023-08-06 16:08:17.870 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-06 16:08:18.270 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-07 21:52:22.299 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-07 21:52:22.711 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-09 00:33:42.587 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-09 00:33:43.049 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-10 09:36:00.590 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 deep-scrub starts >>> 2023-08-10 09:36:28.811 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 deep-scrub ok >>> 2023-08-11 12:59:14.219 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-11 12:59:14.567 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-12 13:52:44.073 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-12 13:52:44.483 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-14 01:51:04.774 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 deep-scrub starts >>> 2023-08-14 01:51:33.113 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 deep-scrub ok >>> 2023-08-15 05:18:16.093 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-15 05:18:16.520 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-16 09:47:38.520 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-16 09:47:38.930 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-17 19:25:45.352 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-17 19:25:45.775 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-19 05:40:43.663 7fc49b1de640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub starts >>> 2023-08-19 05:40:44.073 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scrub ok >>> 2023-08-20 12:06:54.343 7fc49f1e6640 0 log_channel(cluster) log >>> [DBG] : 278.2f3 scr
[ceph-users] Re: Contionuous spurious repairs without cause?
Hi, thanks for the hint. We’re definitely running exact same binaries for all. :) > On 5. Sep 2023, at 16:14, Eugen Block wrote: > > Hi, > > it sounds like you have auto-repair enabled (osd_scrub_auto_repair). I guess > you could disable that to see what's going on with the PGs and their > replicas. And/or you could enable debug logs. Are all daemons running the > same ceph (minor) version? I remember a customer case where different ceph > minor versions (but overall Octopus) caused damaged PGs, a repair fixed them > everytime. After they updated all daemons to the same minor version those > errors were gone. > > Regards, > Eugen > > Zitat von Christian Theune : > >> Hi, >> >> this is a bit older cluster (Nautilus, bluestore only). >> >> We’ve noticed that the cluster is almost continuously repairing PGs. >> However, they all finish successfully with “0 fixed”. We do not see the >> trigger why Ceph decides to repair the PGs and it’s happening for a lot of >> PGs, not any specific individual one. >> >> Deep-scrubs are generally running, but currently a bit late as we had some >> recoveries in the last week. >> >> Logs look regular aside from the number of repairs. Here’s the last weeks >> from the perspective of a single PG. There’s one repair, but the same thing >> seems to happen for all PGs. >> >> 2023-08-06 16:08:17.870 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-06 16:08:18.270 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-07 21:52:22.299 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-07 21:52:22.711 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-09 00:33:42.587 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-09 00:33:43.049 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-10 09:36:00.590 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub starts >> 2023-08-10 09:36:28.811 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub ok >> 2023-08-11 12:59:14.219 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-11 12:59:14.567 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-12 13:52:44.073 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-12 13:52:44.483 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-14 01:51:04.774 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub starts >> 2023-08-14 01:51:33.113 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub ok >> 2023-08-15 05:18:16.093 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-15 05:18:16.520 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-16 09:47:38.520 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-16 09:47:38.930 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-17 19:25:45.352 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-17 19:25:45.775 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-19 05:40:43.663 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-19 05:40:44.073 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-20 12:06:54.343 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-20 12:06:54.809 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-21 19:23:10.801 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub starts >> 2023-08-21 19:23:39.936 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub ok >> 2023-08-23 03:43:21.391 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08-23 03:43:21.844 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub ok >> 2023-08-24 04:21:17.004 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub starts >> 2023-08-24 04:21:47.972 7fc49f1e6640 0 log_channel(cluster) log [DBG] : >> 278.2f3 deep-scrub ok >> 2023-08-25 06:55:13.588 7fc49b1de640 0 log_channel(cluster) log [DBG] : >> 278.2f3 scrub starts >> 2023-08
[ceph-users] Contionuous spurious repairs without cause?
) log [DBG] : 278.2f3 scrub starts 2023-09-04 03:16:15.295 7f37ca268640 0 log_channel(cluster) log [DBG] : 278.2f3 scrub ok 2023-09-05 14:50:36.064 7f37ca268640 0 log_channel(cluster) log [DBG] : 278.2f3 repair starts 2023-09-05 14:51:04.407 7f37c6260640 0 log_channel(cluster) log [DBG] : 278.2f3 repair ok, 0 fixed Googling didn’t help, unfortunately and the bug tracker doesn’t appear to have any relevant issue either. Any ideas? Liebe Grüße, Christian Theune -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can ceph-volume manage the LVs optionally used for DB / WAL at all?
On 25.08.23 09:09, Eugen Block wrote: I'm still not sure if we're on the same page. Maybe not, I'll respond inline to clarify. By looking at https://docs.ceph.com/en/latest/man/8/ceph-volume/#cmdoption-ceph-volume-lvm-prepare-block.db it seems that ceph-volume wants an LV or partition. So it's apparently not just taking a VG itself? Also if there were multiple VGs / devices , I likely would need to at least pick those. ceph-volume creates all required VGs/LVs automatically, and the OSD creation happens in batch mode, for example when run by cephadm: ceph-volume lvm batch --yes /dev/sdb /dev/sdc /dev/sdd In a non-cephadm deployment you can fiddle with ceph-volume manually, where you also can deploy single OSDs, with or without providing your own pre-built VGs/LVs. In a cephadm deployment manually creating OSDs will result in "stray daemons not managed by cephadm" warnings. 1) I am mostly asking about an non-cephadm environment and would just like to know if ceph-volume can also manage the VG of a DB/WAL device that is used for multiple OSD and create the individual LVs which are used for DB or WAL devices when creating a single OSD. Below you give an example "before we upgraded to Pacific" in which you run lvcreate manually. Is that not required anymore with >= Quincy? 2) Even with cephadm there is the "db_devices" as part of the drivegroups. But the question remains if cephadm can use a single db_device for multiple OSDs. Before we upgraded to Pacific we did manage our block.db devices manually with pre-built LVs, e.g.: $ lvcreate -L 30G -n bluefsdb-30 ceph-journals $ ceph-volume lvm create --data /dev/sdh --block.db ceph-journals/bluefsdb-30 As asked and explained in the paragraph above, this is what I am currently doing (lvcreate + ceph-volume lvm create). My question therefore is, if ceph-volume (!) could somehow create this LV for the DB automagically if I'd just give it a device (or existing VG)? Thank you very much for your patience in clarifying and responding to my questions. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can ceph-volume manage the LVs optionally used for DB / WAL at all?
On 11.08.23 16:06, Eugen Block wrote: if you deploy OSDs from scratch you don't have to create LVs manually, that is handled entirely by ceph-volume (for example on cephadm based clusters you only provide a drivegroup definition). By looking at https://docs.ceph.com/en/latest/man/8/ceph-volume/#cmdoption-ceph-volume-lvm-prepare-block.db it seems that ceph-volume wants an LV or partition. So it's apparently not just taking a VG itself? Also if there were multiple VGs / devices , I likely would need to at least pick those. But I suppose this orchestration would then require cephadm (https://docs.ceph.com/en/latest/cephadm/services/osd/#drivegroups) and cannot be done via ceph-volume which merely takes care of ONE OSD at a time. I'm not sure if automating db/wal migration has been considered, it might be (too) difficult. But moving the db/wal devices to new/different devices doesn't seem to be a reoccuring issue (corner case?), so maybe having control over that process for each OSD individually is the safe(r) option in case something goes wrong. Sorry for the confusion. I was not talking about any migrations, just the initial creation of spinning rust OSDs with DB or WAL on fast storage. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] When to use the auth profiles simple-rados-client and profile simple-rados-client-with-blocklist?
Hey ceph-users, 1) When configuring Gnocchi to use Ceph storage (see https://gnocchi.osci.io/install.html#ceph-requirements) I was wondering if one could use any of the auth profiles like * simple-rados-client * simple-rados-client-with-blocklist ? Or are those for different use cases? 2) I was also wondering why the documentation mentions "(Monitor only)" but then it says "Gives a user read-only permissions for monitor, OSD, and PG data."? 3) And are those profiles really for "read-only" users? Why don't they have "read-only" in their name like the rbd and the corresponding "rbd-read-only" profile? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Can ceph-volume manage the LVs optionally used for DB / WAL at all?
Hey ceph-users, I was wondering if ceph-volume did anything in regards to the management (creation, setting metadata, ) of LVs which are used for DB / WAL of an OSD? Reading the documentation at https://docs.ceph.com/en/latest/man/8/ceph-volume/#new-db is seems to indicate that the LV to be used as e.g. DB needs to be created manually (without ceph-volume) and exist prior to using ceph-volume to move the DB to that LV? I suppose the same is true for "ceph-volume lvm create" or "ceph-volume lvm prepare" and "--block.db" It's not that creating a few LVs is hard... it's just that ceph volume does apply some structure to the naming of LVM VGs and LVs on the OSD device and also adds metadata. That would then be up to the user, right? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-volume lvm new-db fails
On 10/08/2023 13:30, Christian Rohmann wrote: It's already fixed master, but the backports are all still pending ... There are PRs for the backports now: * https://tracker.ceph.com/issues/62060 * https://tracker.ceph.com/issues/62061 * https://tracker.ceph.com/issues/62062 Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-volume lvm new-db fails
On 11/05/2022 23:21, Joost Nieuwenhuijse wrote: After a reboot the OSD turned out to be corrupt. Not sure if ceph-volume lvm new-db caused the problem, or failed because of another problem. I just ran into the same issue trying to add a db to an existing OSD. Apparently this is a known bug: https://tracker.ceph.com/issues/55260 It's already fixed master, but the backports are all still pending ... Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Not all Bucket Shards being used
> Thank you for the information, Christian. When you reshard the bucket id is > updated (with most recent versions of ceph, a generation number is > incremented). The first bucket id matches the bucket marker, but after the > first reshard they diverge. This makes a lot of sense and explains why the large omap objects do not go away. It is the old shards that are too big. > The bucket id is in the names of the currently used bucket index shards. > You’re searching for the marker, which means you’re finding older bucket > index shards. > > Change your commands to these: > > # rados -p raum.rgw.buckets.index ls \ >|grep 3caabb9a-4e3b-4b8a-8222-34c33dd63210.10648356.1 \ >|sort -V > > # rados -p raum.rgw.buckets.index ls \ >|grep 3caabb9a-4e3b-4b8a-8222-34c33dd63210.10648356.1 \ >|sort -V \ >|xargs -IOMAP sh -c \ >'rados -p raum.rgw.buckets.index listomapkeys OMAP | wc -l' I don't think the outputs are very interesting here. They are as expected: - 131 lines of rados objects (omap) - each omap contains about 70k keys (below the 100k limit). > When you refer to the “second zone”, what do you mean? Is this cluster using > multisite? If and only if your answer is “no”, then it’s safe to remove old > bucket index shards. Depending on the version of ceph running when reshard > was run, they were either intentionally left behind (earlier behavior) or > removed automatically (later behavior). Yes, this cluster uses multisite. It is one realm, one zonegroup with two zones (bidirectional sync). Ceph warns about resharding on the non-metadata zone. So I did not do that and only resharded on the metadata zone. The resharding was done using a radosgw-admin v16.2.6 on a ceph cluster running v17.2.5. Is there a way to get rid of the old (big) shards without breaking something? Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Not all Bucket Shards being used
Hi Eric, > 1. I recommend that you *not* issue another bucket reshard until you figure > out what’s going on. Thanks, noted! > 2. Which version of Ceph are you using? 17.2.5 I wanted to get the Cluster to Health OK before upgrading. I didn't see anything that led me to believe that an upgrade could fix the reshard issue. > 3. Can you issue a `radosgw-admin metadata get bucket:` so we > can verify what the current marker is? # radosgw-admin metadata get bucket:sql20 { "key": "bucket:sql20", "ver": { "tag": "_hGhtgzjcWY9rO9JP7YlWzt8", "ver": 3 }, "mtime": "2023-07-12T15:56:55.226784Z", "data": { "bucket": { "name": "sql20", "marker": "3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9", "bucket_id": "3caabb9a-4e3b-4b8a-8222-34c33dd63210.10648356.1", "tenant": "", "explicit_placement": { "data_pool": "", "data_extra_pool": "", "index_pool": "" } }, "owner": "S3user", "creation_time": "2023-04-26T09:22:01.681646Z", "linked": "true", "has_bucket_info": "false" } } > 4. After you resharded previously, did you get command-line output along the > lines of: > 2023-07-24T13:33:50.867-0400 7f10359f2a80 1 execute INFO: reshard of bucket > “" completed successfully I think so, at least for the second reshard. But I wouldn't bet my life on it. I fear I might have missed an error on the first one since I have done a radosgw-admin bucket reshard so often and never seen it fail. Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Not all Bucket Shards being used
Hi, I have trouble with large OMAP files in a cluster in the RGW index pool. Some background information about the cluster: There is CephFS and RBD usage on the main cluster but for this issue I think only S3 is interesting. There is one realm, one zonegroup with two zones which have a bidirectional sync set up. Since this does not allow for autoresharding we have to do it by hand in this cluster – looking forward to Reef! From the logs: cluster 2023-07-17T22:59:03.018722+ osd.75 (osd.75) 623978 : cluster [WRN] Large omap object found. Object: 34:bcec3016:::.dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.5:head PG: 34.680c373d (34.5) Key count: 962091 Size (bytes): 277963182 The offending bucket looks like this: # radosgw-admin bucket stats \ | jq '.[] | select(.marker =="3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9") |"\(.num_shards) \(.usage["rgw.main"].num_objects)"' -r 131 9463833 Last week the number of objects was about 12 million. Which is why I reshareded the offending bucket twice, I think. Once to 129 and the second time to 131 because I wanted some leeway (or lieway? scnr, Sage). Unfortunately, even after a week the objects were still to big (the log line above is quite recent), so I looked into it again. # rados -p raum.rgw.buckets.index ls \ |grep .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9 \ |sort -V .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.0 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.1 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.2 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.3 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.4 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.5 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.6 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.7 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.8 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.9 .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9.10 # rados -p raum.rgw.buckets.index ls \ |grep .dir.3caabb9a-4e3b-4b8a-8222-34c33dd63210.10610190.9 \ |sort -V \ |xargs -IOMAP sh -c \ 'rados -p raum.rgw.buckets.index listomapkeys OMAP | wc -l' 1013854 1011007 1012287 1011232 1013565 998262 1012777 1012713 1012230 1010690 997111 Apparently, only 11 shards are in use. This would explain why the "Key usage" (from the log line) is about ten times higher than I would expect. How can I deal with this issue? One thing I could try to fix this would be to reshard to a lower number, but I am not sure if there are any risks associated with "downsharding". After that I could reshard to something like 97. Or I could directly "downshard" to 97. Also, the second zone has a similar problem, but as the error messsage lets me know, this would be a bad idea. Will it just take more time until the sharding is transferred to the seconds zone? Best, Christian Kugler ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Adding datacenter level to CRUSH tree causes rebalancing
Based on my understanding of CRUSH it basically works down the hierarchy and then randomly (but deterministically for a given CRUSH map) picks buckets (based on the specific selection rule) on that level for the object and then it does this recursively until it ends up at the leaf nodes. Given that you introduced a whole hierarchy level just below the top, objects will now be distributed differently since the pseudo-random hash-based selection strategy may now for example put an object that used to be in node-4 under FSN-DC16 instead So basically when you fiddle with the hierarchy you can generally expect lots of data movement everywhere downstream of your change. On Sun, 16 Jul 2023 at 06:03, Niklas Hambüchen wrote: > Hi Ceph users, > > I have a Ceph 16.2.7 cluster that so far has been replicated over the > `host` failure domain. > All `hosts` have been chosen to be in different `datacenter`s, so that was > sufficient. > > Now I wish to add more hosts, including some in already-used data centers, > so I'm planning to use CRUSH's `datacenter` failure domain instead. > > My problem is that when I add the `datacenter`s into the CRUSH tree, Ceph > decides that it should now rebalance the entire cluster. > This seems unnecessary, and wrong. > > Before, `ceph osd tree` (some OSDs omitted for legibility): > > > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT > PRI-AFF > -1 440.73514 root default > -3 146.43625 host node-4 >2hdd 14.61089 osd.2up 1.0 > 1.0 >3hdd 14.61089 osd.3up 1.0 > 1.0 > -7 146.43625 host node-5 > 14hdd 14.61089 osd.14 up 1.0 > 1.0 > 15hdd 14.61089 osd.15 up 1.0 > 1.0 > -10 146.43625 host node-6 > 26hdd 14.61089 osd.26 up 1.0 > 1.0 > 27hdd 14.61089 osd.27 up 1.0 > 1.0 > > > After assigning of `datacenter` crush buckets: > > > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT > PRI-AFF > -1 440.73514 root default > -18 146.43625 datacenter FSN-DC16 > -7 146.43625 host node-5 > 14hdd 14.61089 osd.14 up 1.0 > 1.0 > 15hdd 14.61089 osd.15 up 1.0 > 1.0 > -17 146.43625 datacenter FSN-DC18 > -10 146.43625 host node-6 > 26hdd 14.61089 osd.26 up 1.0 > 1.0 > 27hdd 14.61089 osd.27 up 1.0 > 1.0 > -16 146.43625 datacenter FSN-DC4 > -3 146.43625 host node-4 >2hdd 14.61089 osd.2up 1.0 > 1.0 >3hdd 14.61089 osd.3up 1.0 > 1.0 > > > This shows that the tree is essentially unchanged, it just "gained a > level". > > In `ceph status` I now get: > > pgs: 1167541260/1595506041 objects misplaced (73.177%) > > If I remove the `datacenter` level again, then the misplacement disappears. > > On a minimal testing cluster, this misplacement issue did not appear. > > Why does Ceph think that these objects are misplaced when I add the > datacenter level? > Is there a more correct way to do this? > > > Thanks! > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW accessing real source IP address of a client (e.g. in S3 bucket policies)
Hey Casey, all, On 16/06/2023 17:00, Casey Bodley wrote: But when applying a bucket policy with aws:SourceIp it seems to only work if I set the internal IP of the HAProxy instance, not the public IP of the client. So the actual remote address is NOT used in my case. Did I miss any config setting anywhere? your 'rgw remote addr param' config looks right. with that same config, i was able to set a bucket policy that denied access based on I found the issue. Embarrassingly it was simply a NAT-Hairpin which was applied to the traffic from the server I was testing with. In short: Even though I targeted the public IP from the HAProxy instance the internal IP address of my test server was maintained as source since both machines are on the same network segment. That is why I first thought the LB IP was applied to the policy, but not the actual public source IP of the client. In reality it was simply the private, RFC1918, IP of the test machine that came in as source. Sorry for the noise and thanks for your help. Christian P.S. With IPv6, this would not have happened. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW multisite logs (data, md, bilog) not being trimmed automatically?
There was a similar issue reported at https://tracker.ceph.com/issues/48103 and yet another ML post at https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/5LGXQINAJBIGFUZP5WEINVHNPBJEV5X7 May I second the question if it's safe to run radosgw-admin autotrim on those logs? If so, why is that required and why seems to be no periodic trimming happening? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Bluestore compression - Which algo to choose? Zstd really still that bad?
Hey Igor, On 27/06/2023 12:06, Igor Fedotov wrote: I can't say anything about your primary question on zstd benefits/drawbacks but I'd like to emphasize that compression ratio at BlueStore is (to a major degree) determined by the input data flow characteristics (primarily write block size), object store allocation unit size (bluestore_min_alloc_size) and some parameters (e.g. maximum blob size) that determine how input data chunks are logically split when landing on disk. E.g. if one has min_alloc_size set to 4K and write block size is in (4K-8K] then resulting compressed block would never be less than 4K. Hence compression ratio is never more than 2. Similarly if min_alloc_size is 64K there would be no benefit in compression at all for the above input since target allocation units are always larger than input blocks. The rationale of the above behavior is that compression is applied exclusively on input blocks - there is no additional processing to merge input and existing data and compress them all together. Thanks for the emphasis on input data and its block-size. Yes, that is certainly the most important factor for the compression efficiency and choice of an suitable algorithm for a certain use-case. In my case the pool is RBD only, so (by default) the blocks are 4M if I am not mistaken. I also understand that even though larger blocks generally compress better, I know there is no relation between different blocks in regard to compression dictionaries (going along the lines of de-duplication). In the end in my use-case it boils down to the type of data stored on the RBD images and how compressible that might be. But since those blocks are only written once, and I am ready to invest more CPU cycles to reduce the size on disk. I am simply looking for data other might have collected on their similar use-cases. Also I am still wondering if there really is nobody that worked/played more with zstd since that has become so popular in recent months... Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW multisite logs (data, md, bilog) not being trimmed automatically?
Hey ceph-users, I am running two (now) Quincy clusters doing RGW multi-site replication with only one actually being written to by clients. The other site is intended simply as a remote copy. On the primary cluster I am observing an ever growing (objects and bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which is only 300MiB and around 15k objects with no growth. Metrics show that the growth of pool on primary is linear for at least 6 months, so not sudden spikes or anything. Also sync status appears to be totally happy. There are also no warnings in regards to large OMAPs or anything similar. I was under the impression that RGW will trim its three logs (md, bi, data) automatically and only keep data that has not yet been replicated by the other zonegroup members? The config option "ceph config get mgr rgw_sync_log_trim_interval" is set to 1200, so 20 Minutes. So I am wondering if there might be some inconsistency or how I can best analyze what the cause for the accumulation of log data is? There are older questions on the ML, such as [1], but there was not really a solution or root cause identified. I know there is manual trimming, but I'd rather want to analyze the current situation and figure out what the cause for the lack of auto-trimming is. * Do I need to go through all buckets and count logs and look at their timestamps? Which queries do make sense here? * Is there usually any logging of the log trimming activity that I should expect? Or that might indicate why trimming does not happen? Regards Christian [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/WZCFOAMLWV3XCGJ3TVLHGMJFVYNZNKLD/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Radogw ignoring HTTP_X_FORWARDED_FOR header
Hello Yosr, On 26/06/2023 11:41, Yosr Kchaou wrote: We are facing an issue with getting the right value for the header HTTP_X_FORWARDED_FOR when getting client requests. We need this value to do the source ip check validation. [...] Currently, RGW sees that all requests come from 127.0.0.1. So it is still considering the nginx ip address and not the client who made the request. May I point you to my recent post to this ML about this very question: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/IKGLAROSVWHSRZQSYTLLHVRWFPOLBEGL/ I am still planning to reproduce this issue with simple examples and headers set manually via e.g. curl to rule out anything stupid I might have misconfigured in my case. I just did not find the time yet. But did you sniff any traffic to the backend or verified how the headers look like in your case? Any debug logging "debug rgw = 20" where you can see what RGW things of the incoming request? Did you test with S3 bucket policies or how did you come to the conclusion that RGW is not using the X_FORWARDED_FOR header? Or what is your indication that things are not working as expected? From what I can see, the rgw client log does NOT print the external IP from the header, but the source IP of the incoming TCP connection: 2023-06-26T11:14:37.070+ 7f0389e0b700 1 beast: 0x7f051c776660: 192.168.1.1 - someid [26/Jun/2023:11:14:36.990 +] "PUT /bucket/object HTTP/1.1" 200 43248 - "aws-sdk-go/1.27.0 (go1.16.15; linux; amd64) S3Manager" - latency=0.07469s while the rgw ops log does indeed print the remote_address in remote_addr: {"bucket":"bucket","time":"2023-06-26T11:16:08.721465Z","time_local":"2023-06-26T11:16:08.721465+","remote_addr":"xxx.xxx.xxx.xxx","user":"someuser","operation":"put_obj","uri":"PUT /bucket/object HTTP/1.1","http_status":"200","error_code":"","bytes_sent":0,"bytes_received":64413,"object_size":64413,"total_time":155,"user_agent":"aws-sdk-go/1.27.0 (go1.16.15; linux; amd64) S3Manager","referrer":"","trans_id":"REDACTED","authentication_type":"Keystone","access_key_id":"REDACTED","temp_url":false} So in my case it's not that RGW does not receive and logs this info, but more about it not applying this in a bucket policy (as far as my analysis of the issue goes). Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Bluestore compression - Which algo to choose? Zstd really still that bad?
Hey ceph-users, we've been using the default "snappy" to have Ceph compress data on certain pools - namely backups / copies of volumes of a VM environment. So it's write once, and no random access. I am now wondering if switching to another algo (there is snappy, zlib, lz4, or zstd) would improve the compression ratio (significantly)? * Does anybody have any real world data on snappy vs. $anyother? Using zstd is tempting as it's used in various other applications (btrfs, MongoDB, ...) for inline-compression with great success. For Ceph though there is a warning ([1]), about it being not recommended in the docs still. But I am wondering if this still stands with e.g. [2] merged. And there was [3] trying to improve the performance, this this reads as it only lead to a dead-end and no code changes? In any case does anybody have any numbers to help with the decision on the compression algo? Regards Christian [1] https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#confval-bluestore_compression_algorithm [2] https://github.com/ceph/ceph/pull/33790 [3] https://github.com/facebook/zstd/issues/910 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph quincy repo update to debian bookworm...?
Hi ceph users/maintainers, I installed ceph quincy on debian bullseye as a ceph client and now want to update to bookworm. I see that there is at the moment only bullseye supported. https://download.ceph.com/debian-quincy/dists/bullseye/ Will there be an update of deb https://download.coeh.com/debian-quincy/ bullseye main to deb https://download.coeh.com/debian-quincy/ boowkworm main in the near future!? Regards, Christian OpenPGP_0xC20C05037880471C.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake
Aaaand another dead end: there is too much meta-data involved (bucket and object ACLs, lifecycle, policy, …) that won’t be possible to perfectly migrate. Also, lifecycles _might_ be affected if mtimes change. So, I’m going to try and go back to a single-cluster multi-zone setup. For that I’m going to change all buckets with explicit placements to remove the explicit placement markers (those were created from old versions of Ceph and weren’t intentional by us, they perfectly reflect the default placement configuration). Here’s the patch I’m going to try on top of our Nautilus branch now: https://github.com/flyingcircusio/ceph/commit/b3a317987e50f089efc4e9694cf6e3d5d9c23bd5 All our buckets with explicit placements conform perfectly to the default placement, so this seems safe. Otherwise Zone migration was perfect until I noticed the objects with explicit placements in our staging and production clusters. (The dev cluster seems to have been purged intermediately, so this wasn’t noticed). I’m actually wondering whether explicit placements are really a sensible thing to have, even in multi-cluster multi-zone setups. AFAICT due to realms you might end up with different zonegroups referring to the same pools and this should only run through proper abstractions … o_O Cheers, Christian > On 14. Jun 2023, at 17:42, Christian Theune wrote: > > Hi, > > further note to self and for posterity … ;) > > This turned out to be a no-go as well, because you can’t silently switch the > pools to a different storage class: the objects will be found, but the index > still refers to the old storage class and lifecycle migrations won’t work. > > I’ve brainstormed for further options and it appears that the last resort is > to use placement targets and copy the buckets explicitly - twice, because on > Nautilus I don’t have renames available, yet. :( > > This will require temporary downtimes prohibiting users to access their > bucket. Fortunately we only have a few very large buckets (200T+) that will > take a while to copy. We can pre-sync them of course, so the downtime will > only be during the second copy. > > Christian > >> On 13. Jun 2023, at 14:52, Christian Theune wrote: >> >> Following up to myself and for posterity: >> >> I’m going to try to perform a switch here using (temporary) storage classes >> and renaming of the pools to ensure that I can quickly change the STANDARD >> class to a better EC pool and have new objects located there. After that >> we’ll add (temporary) lifecycle rules to all buckets to ensure their objects >> will be migrated to the STANDARD class. >> >> Once that is finished we should be able to delete the old pool and the >> temporary storage class. >> >> First tests appear successfull, but I’m a bit struggling to get the bucket >> rules working (apparently 0 days isn’t a real rule … and the debug interval >> setting causes high frequent LC runs but doesn’t seem move objects just yet. >> I’ll play around with that setting a bit more, though, I think I might have >> tripped something that only wants to process objects every so often and on >> an interval of 10 a day is still 2.4 hours … >> >> Cheers, >> Christian >> >>> On 9. Jun 2023, at 11:16, Christian Theune wrote: >>> >>> Hi, >>> >>> we are running a cluster that has been alive for a long time and we tread >>> carefully regarding updates. We are still a bit lagging and our cluster >>> (that started around Firefly) is currently at Nautilus. We’re updating and >>> we know we’re still behind, but we do keep running into challenges along >>> the way that typically are still unfixed on main and - as I started with - >>> have to tread carefully. >>> >>> Nevertheless, mistakes happen, and we found ourselves in this situation: we >>> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, >>> m=3, with 17 hosts) but when doing the EC profile selection we missed that >>> our hosts are not evenly balanced (this is a growing cluster and some >>> machines have around 20TiB capacity for the RGW data pool, wheres newer >>> machines have around 160TiB and we rather should have gone with k=4, m=3. >>> In any case, having 13 chunks causes too many hosts to participate in each >>> object. Going for k+m=7 will allow distribution to be more effective as we >>> have 7 hosts that have the 160TiB sizing. >>> >>> Our original migration used the “cache tiering” approach, but that only >>> works once when moving from replicated to EC and can not be used for >>> further
[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake
What got lost is that I need to change the pool’s m/k parameters, which is only possible by creating a new pool and moving all data from the old pool. Changing the crush rule doesn’t allow you to do that. > On 16. Jun 2023, at 23:32, Nino Kotur wrote: > > If you create new crush rule for ssd/nvme/hdd and attach it to existing pool > you should be able to do the migration seamlessly while everything is > online... However impact to user will depend on storage devices load and > network utilization as it will create chaos on cluster network. > > Or did i get something wrong? > > > > > Kind regards, > Nino > > > On Wed, Jun 14, 2023 at 5:44 PM Christian Theune wrote: > Hi, > > further note to self and for posterity … ;) > > This turned out to be a no-go as well, because you can’t silently switch the > pools to a different storage class: the objects will be found, but the index > still refers to the old storage class and lifecycle migrations won’t work. > > I’ve brainstormed for further options and it appears that the last resort is > to use placement targets and copy the buckets explicitly - twice, because on > Nautilus I don’t have renames available, yet. :( > > This will require temporary downtimes prohibiting users to access their > bucket. Fortunately we only have a few very large buckets (200T+) that will > take a while to copy. We can pre-sync them of course, so the downtime will > only be during the second copy. > > Christian > > > On 13. Jun 2023, at 14:52, Christian Theune wrote: > > > > Following up to myself and for posterity: > > > > I’m going to try to perform a switch here using (temporary) storage classes > > and renaming of the pools to ensure that I can quickly change the STANDARD > > class to a better EC pool and have new objects located there. After that > > we’ll add (temporary) lifecycle rules to all buckets to ensure their > > objects will be migrated to the STANDARD class. > > > > Once that is finished we should be able to delete the old pool and the > > temporary storage class. > > > > First tests appear successfull, but I’m a bit struggling to get the bucket > > rules working (apparently 0 days isn’t a real rule … and the debug interval > > setting causes high frequent LC runs but doesn’t seem move objects just > > yet. I’ll play around with that setting a bit more, though, I think I might > > have tripped something that only wants to process objects every so often > > and on an interval of 10 a day is still 2.4 hours … > > > > Cheers, > > Christian > > > >> On 9. Jun 2023, at 11:16, Christian Theune wrote: > >> > >> Hi, > >> > >> we are running a cluster that has been alive for a long time and we tread > >> carefully regarding updates. We are still a bit lagging and our cluster > >> (that started around Firefly) is currently at Nautilus. We’re updating and > >> we know we’re still behind, but we do keep running into challenges along > >> the way that typically are still unfixed on main and - as I started with - > >> have to tread carefully. > >> > >> Nevertheless, mistakes happen, and we found ourselves in this situation: > >> we converted our RGW data pool from replicated (n=3) to erasure coded > >> (k=10, m=3, with 17 hosts) but when doing the EC profile selection we > >> missed that our hosts are not evenly balanced (this is a growing cluster > >> and some machines have around 20TiB capacity for the RGW data pool, wheres > >> newer machines have around 160TiB and we rather should have gone with k=4, > >> m=3. In any case, having 13 chunks causes too many hosts to participate > >> in each object. Going for k+m=7 will allow distribution to be more > >> effective as we have 7 hosts that have the 160TiB sizing. > >> > >> Our original migration used the “cache tiering” approach, but that only > >> works once when moving from replicated to EC and can not be used for > >> further migrations. > >> > >> The amount of data is at 215TiB somewhat significant, so using an approach > >> that scales when copying data[1] to avoid ending up with months of > >> migration. > >> > >> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on > >> a rados/pool level) and I guess we can only fix this on an application > >> level using multi-zone replication. > >> > >> I have the setup nailed in general, but I’m running into issues with > >> buckets in ou
[ceph-users] Re: RGW accessing real source IP address of a client (e.g. in S3 bucket policies)
On 15/06/2023 15:46, Casey Bodley wrote: * In case of HTTP via headers like "X-Forwarded-For". This is apparently supported only for logging the source in the "rgw ops log" ([1])? Or is this info used also when evaluating the source IP condition within a bucket policy? yes, the aws:SourceIp condition key does use the value from X-Forwarded-For when present I have an HAProxy in front of the RGWs which has "option forwardfor" set to add the "X-Forwarded-For" header. Then the RGWs have "rgw remote addr param = http_x_forwarded_for" set, according to https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_remote_addr_param and I also see remote_addr properly logged within the rgw ops log. But when applying a bucket policy with aws:SourceIp it seems to only work if I set the internal IP of the HAProxy instance, not the public IP of the client. So the actual remote address is NOT used in my case. Did I miss any config setting anywhere? Regards and thanks for your help Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW accessing real source IP address of a client (e.g. in S3 bucket policies)
Hello Ceph-Users, context or motivation of my question is S3 bucket policies and other cases using the source IP address as condition. I was wondering if and how RadosGW is able to access the source IP address of clients if receiving their connections via a loadbalancer / reverse proxy like HAProxy. So naturally that is where the connection originates from in that case, rendering a policy based on IP addresses useless. Depending on whether the connection balanced as HTTP or TCP there are two ways to carry information about the actual source: * In case of HTTP via headers like "X-Forwarded-For". This is apparently supported only for logging the source in the "rgw ops log" ([1])? Or is this info used also when evaluating the source IP condition within a bucket policy? * In case of TCP loadbalancing, there is the proxy protocol v2. This unfortunately seems not even supposed by the BEAST library which RGW uses. I opened feature requests ... ** https://tracker.ceph.com/issues/59422 ** https://github.com/chriskohlhoff/asio/issues/1091 ** https://github.com/boostorg/beast/issues/2484 but there is no outcome yet. Regards Christian [1] https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_remote_addr_param ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake
Hi, further note to self and for posterity … ;) This turned out to be a no-go as well, because you can’t silently switch the pools to a different storage class: the objects will be found, but the index still refers to the old storage class and lifecycle migrations won’t work. I’ve brainstormed for further options and it appears that the last resort is to use placement targets and copy the buckets explicitly - twice, because on Nautilus I don’t have renames available, yet. :( This will require temporary downtimes prohibiting users to access their bucket. Fortunately we only have a few very large buckets (200T+) that will take a while to copy. We can pre-sync them of course, so the downtime will only be during the second copy. Christian > On 13. Jun 2023, at 14:52, Christian Theune wrote: > > Following up to myself and for posterity: > > I’m going to try to perform a switch here using (temporary) storage classes > and renaming of the pools to ensure that I can quickly change the STANDARD > class to a better EC pool and have new objects located there. After that > we’ll add (temporary) lifecycle rules to all buckets to ensure their objects > will be migrated to the STANDARD class. > > Once that is finished we should be able to delete the old pool and the > temporary storage class. > > First tests appear successfull, but I’m a bit struggling to get the bucket > rules working (apparently 0 days isn’t a real rule … and the debug interval > setting causes high frequent LC runs but doesn’t seem move objects just yet. > I’ll play around with that setting a bit more, though, I think I might have > tripped something that only wants to process objects every so often and on an > interval of 10 a day is still 2.4 hours … > > Cheers, > Christian > >> On 9. Jun 2023, at 11:16, Christian Theune wrote: >> >> Hi, >> >> we are running a cluster that has been alive for a long time and we tread >> carefully regarding updates. We are still a bit lagging and our cluster >> (that started around Firefly) is currently at Nautilus. We’re updating and >> we know we’re still behind, but we do keep running into challenges along the >> way that typically are still unfixed on main and - as I started with - have >> to tread carefully. >> >> Nevertheless, mistakes happen, and we found ourselves in this situation: we >> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, >> m=3, with 17 hosts) but when doing the EC profile selection we missed that >> our hosts are not evenly balanced (this is a growing cluster and some >> machines have around 20TiB capacity for the RGW data pool, wheres newer >> machines have around 160TiB and we rather should have gone with k=4, m=3. >> In any case, having 13 chunks causes too many hosts to participate in each >> object. Going for k+m=7 will allow distribution to be more effective as we >> have 7 hosts that have the 160TiB sizing. >> >> Our original migration used the “cache tiering” approach, but that only >> works once when moving from replicated to EC and can not be used for further >> migrations. >> >> The amount of data is at 215TiB somewhat significant, so using an approach >> that scales when copying data[1] to avoid ending up with months of migration. >> >> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a >> rados/pool level) and I guess we can only fix this on an application level >> using multi-zone replication. >> >> I have the setup nailed in general, but I’m running into issues with buckets >> in our staging and production environment that have `explicit_placement` >> pools attached, AFAICT is this an outdated mechanisms but there are no >> migration tools around. I’ve seen some people talk about patched versions of >> the `radosgw-admin metadata put` variant that (still) prohibits removing >> explicit placements. >> >> AFAICT those explicit placements will be synced to the secondary zone and >> the effect that I’m seeing underpins that theory: the sync runs for a while >> and only a few hundred objects show up in the new zone, as the >> buckets/objects are already found in the old pool that the new zone uses due >> to the explicit placement rule. >> >> I’m currently running out of ideas, but open for any other options. >> >> Looking at >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z >> I’m wondering whether the relevant patch is available somewhere, or whether >> I’ll have to try building that patch again on my own. &
[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake
Following up to myself and for posterity: I’m going to try to perform a switch here using (temporary) storage classes and renaming of the pools to ensure that I can quickly change the STANDARD class to a better EC pool and have new objects located there. After that we’ll add (temporary) lifecycle rules to all buckets to ensure their objects will be migrated to the STANDARD class. Once that is finished we should be able to delete the old pool and the temporary storage class. First tests appear successfull, but I’m a bit struggling to get the bucket rules working (apparently 0 days isn’t a real rule … and the debug interval setting causes high frequent LC runs but doesn’t seem move objects just yet. I’ll play around with that setting a bit more, though, I think I might have tripped something that only wants to process objects every so often and on an interval of 10 a day is still 2.4 hours … Cheers, Christian > On 9. Jun 2023, at 11:16, Christian Theune wrote: > > Hi, > > we are running a cluster that has been alive for a long time and we tread > carefully regarding updates. We are still a bit lagging and our cluster (that > started around Firefly) is currently at Nautilus. We’re updating and we know > we’re still behind, but we do keep running into challenges along the way that > typically are still unfixed on main and - as I started with - have to tread > carefully. > > Nevertheless, mistakes happen, and we found ourselves in this situation: we > converted our RGW data pool from replicated (n=3) to erasure coded (k=10, > m=3, with 17 hosts) but when doing the EC profile selection we missed that > our hosts are not evenly balanced (this is a growing cluster and some > machines have around 20TiB capacity for the RGW data pool, wheres newer > machines have around 160TiB and we rather should have gone with k=4, m=3. In > any case, having 13 chunks causes too many hosts to participate in each > object. Going for k+m=7 will allow distribution to be more effective as we > have 7 hosts that have the 160TiB sizing. > > Our original migration used the “cache tiering” approach, but that only works > once when moving from replicated to EC and can not be used for further > migrations. > > The amount of data is at 215TiB somewhat significant, so using an approach > that scales when copying data[1] to avoid ending up with months of migration. > > I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a > rados/pool level) and I guess we can only fix this on an application level > using multi-zone replication. > > I have the setup nailed in general, but I’m running into issues with buckets > in our staging and production environment that have `explicit_placement` > pools attached, AFAICT is this an outdated mechanisms but there are no > migration tools around. I’ve seen some people talk about patched versions of > the `radosgw-admin metadata put` variant that (still) prohibits removing > explicit placements. > > AFAICT those explicit placements will be synced to the secondary zone and the > effect that I’m seeing underpins that theory: the sync runs for a while and > only a few hundred objects show up in the new zone, as the buckets/objects > are already found in the old pool that the new zone uses due to the explicit > placement rule. > > I’m currently running out of ideas, but open for any other options. > > Looking at > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z > I’m wondering whether the relevant patch is available somewhere, or whether > I’ll have to try building that patch again on my own. > > Going through the docs and the code I’m actually wondering whether > `explicit_placement` is actually a really crufty residual piece that won’t > get used in newer clusters but older clusters don’t really have an option to > get away from? > > In my specific case, the placement rules are identical to the explicit > placements that are stored on (apparently older) buckets and the only thing I > need to do is to remove them. I can accept a bit of downtime to avoid any > race conditions if needed, so maybe having a small tool to just remove those > entries while all RGWs are down would be fine. A call to `radosgw-admin > bucket stat` takes about 18s for all buckets in production and I guess that > would be a good comparison for what timing to expect when running an update > on the metadata. > > I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to > other suggestions. > > Hugs, > Christian > > [1] We currently have 215TiB data in 230M objects. Using the “official” > “cache-flush-evict-all” appr
[ceph-users] RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake
Hi, we are running a cluster that has been alive for a long time and we tread carefully regarding updates. We are still a bit lagging and our cluster (that started around Firefly) is currently at Nautilus. We’re updating and we know we’re still behind, but we do keep running into challenges along the way that typically are still unfixed on main and - as I started with - have to tread carefully. Nevertheless, mistakes happen, and we found ourselves in this situation: we converted our RGW data pool from replicated (n=3) to erasure coded (k=10, m=3, with 17 hosts) but when doing the EC profile selection we missed that our hosts are not evenly balanced (this is a growing cluster and some machines have around 20TiB capacity for the RGW data pool, wheres newer machines have around 160TiB and we rather should have gone with k=4, m=3. In any case, having 13 chunks causes too many hosts to participate in each object. Going for k+m=7 will allow distribution to be more effective as we have 7 hosts that have the 160TiB sizing. Our original migration used the “cache tiering” approach, but that only works once when moving from replicated to EC and can not be used for further migrations. The amount of data is at 215TiB somewhat significant, so using an approach that scales when copying data[1] to avoid ending up with months of migration. I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a rados/pool level) and I guess we can only fix this on an application level using multi-zone replication. I have the setup nailed in general, but I’m running into issues with buckets in our staging and production environment that have `explicit_placement` pools attached, AFAICT is this an outdated mechanisms but there are no migration tools around. I’ve seen some people talk about patched versions of the `radosgw-admin metadata put` variant that (still) prohibits removing explicit placements. AFAICT those explicit placements will be synced to the secondary zone and the effect that I’m seeing underpins that theory: the sync runs for a while and only a few hundred objects show up in the new zone, as the buckets/objects are already found in the old pool that the new zone uses due to the explicit placement rule. I’m currently running out of ideas, but open for any other options. Looking at https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z I’m wondering whether the relevant patch is available somewhere, or whether I’ll have to try building that patch again on my own. Going through the docs and the code I’m actually wondering whether `explicit_placement` is actually a really crufty residual piece that won’t get used in newer clusters but older clusters don’t really have an option to get away from? In my specific case, the placement rules are identical to the explicit placements that are stored on (apparently older) buckets and the only thing I need to do is to remove them. I can accept a bit of downtime to avoid any race conditions if needed, so maybe having a small tool to just remove those entries while all RGWs are down would be fine. A call to `radosgw-admin bucket stat` takes about 18s for all buckets in production and I guess that would be a good comparison for what timing to expect when running an update on the metadata. I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to other suggestions. Hugs, Christian [1] We currently have 215TiB data in 230M objects. Using the “official” “cache-flush-evict-all” approach was unfeasible here as it only yielded around 50MiB/s. Using cache limits and targetting the cache sizes to 0 caused proper parallelization and was able to flush/evict at almost constant 1GiB/s in the cluster. -- Christian Theune · c...@flyingcircus.io · +49 345 219401 0 Flying Circus Internet Operations GmbH · https://flyingcircus.io Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Encryption per user Howto
Hm, this thread is confusing in the context of S3 client-side encryption means - the user is responsible to encrypt the data with their own keys before submitting it. As far as I'm aware, client-side encryption doesn't require any specific server support - it's a function of the client SDK used which provides the convenience of encrypting your data before upload and decryptiing it after download - https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingClientSideEncryption.html But you can always encrypt your data and then upload it via RGW, there is nothing anywhere that prevents that since uploaded objects are just a sequence of bytes, meta data won't be encrypted then You can also do server-side encryption by bringing your own keys - https://docs.ceph.com/en/quincy/radosgw/encryption/#customer-provided-keys I suspect you're asking for server-side encryption with keys managed by ceph on a per-user basis? On Tue, 23 May 2023 at 03:28, huxia...@horebdata.cn wrote: > Hi, Stefan, > > Thanks a lot for the message. It seems that client-side encryption (or per > use) is still on the way and not ready yet for today. > > Are there practical methods to implement encryption for CephFS with > today' technique? e.g using LUKS or other tools? > > Kind regards, > > > Samuel > > > > > huxia...@horebdata.cn > > From: Stefan Kooman > Date: 2023-05-22 17:19 > To: Alexander E. Patrakov; huxia...@horebdata.cn > CC: ceph-users > Subject: Re: [ceph-users] Re: Encryption per user Howto > On 5/21/23 15:44, Alexander E. Patrakov wrote: > > Hello Samuel, > > > > On Sun, May 21, 2023 at 3:48 PM huxia...@horebdata.cn > > wrote: > >> > >> Dear Ceph folks, > >> > >> Recently one of our clients approached us with a request on encrpytion > per user, i.e. using individual encrytion key for each user and encryption > files and object store. > >> > >> Does anyone know (or have experience) how to do with CephFS and Ceph > RGW? > > > > For CephFS, this is unachievable. > > For a couple of years already, work is being done to have fscrypt > support for CephFS [1]. When that work ends up in mainline kernel (and > distro kernels at some point) this will be possible. > > Gr. Stefan > > [1]: https://lwn.net/Articles/829448/ > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pg_autoscaler using uncompressed bytes as pool current total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings?
Hey ceph-users, may I ask (nag) again about this issue? I am wondering if anybody can confirm my observations? I raised a bug https://tracker.ceph.com/issues/54136, but apart from the assignment to a dev a while ago here was not response yet. Maybe I am just holding it wrong, please someone enlighten me. Thank you and with kind regards Christian On 02/02/2022 20:10, Christian Rohmann wrote: Hey ceph-users, I am debugging a mgr pg_autoscaler WARN which states a target_size_bytes on a pool would overcommit the available storage. There is only one pool with value for target_size_bytes (=5T) defined and that apparently would consume more than the available storage: --- cut --- # ceph health detail HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes [WRN] POOL_TARGET_SIZE_BYTES_OVERCOMMITTED: 1 subtrees have overcommitted pool target_size_bytes Pools ['backups', 'images', 'device_health_metrics', '.rgw.root', 'redacted.rgw.control', 'redacted.rgw.meta', 'redacted.rgw.log', 'redacted.rgw.otp', 'redacted.rgw.buckets.index', 'redacted.rgw.buckets.data', 'redacted.rgw.buckets.non-ec'] overcommit available storage by 1.011x due to target_size_bytes 15.0T on pools ['redacted.rgw.buckets.data']. --- cut --- But then looking at the actual usage it seems strange that 15T (5T * 3 replicas) should not fit onto the remaining 122 TiB AVAIL: --- cut --- # ceph df detail --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 293 TiB 122 TiB 171 TiB 171 TiB 58.44 TOTAL 293 TiB 122 TiB 171 TiB 171 TiB 58.44 --- POOLS --- POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR backups 1 1024 92 TiB 92 TiB 3.8 MiB 28.11M 156 TiB 156 TiB 11 MiB 64.77 28 TiB N/A N/A N/A 39 TiB 123 TiB images 2 64 1.7 TiB 1.7 TiB 249 KiB 471.72k 5.2 TiB 5.2 TiB 748 KiB 5.81 28 TiB N/A N/A N/A 0 B 0 B device_health_metrics 19 1 82 MiB 0 B 82 MiB 43 245 MiB 0 B 245 MiB 0 28 TiB N/A N/A N/A 0 B 0 B .rgw.root 21 32 23 KiB 23 KiB 0 B 25 4.1 MiB 4.1 MiB 0 B 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.control 22 32 0 B 0 B 0 B 8 0 B 0 B 0 B 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.meta 23 32 1.7 MiB 394 KiB 1.3 MiB 1.38k 237 MiB 233 MiB 3.9 MiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.log 24 32 53 MiB 500 KiB 53 MiB 7.60k 204 MiB 47 MiB 158 MiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.otp 25 32 5.2 KiB 0 B 5.2 KiB 0 16 KiB 0 B 16 KiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.buckets.index 26 32 1.2 GiB 0 B 1.2 GiB 7.46k 3.5 GiB 0 B 3.5 GiB 0 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.buckets.data 27 128 3.1 TiB 3.1 TiB 0 B 3.53M 9.5 TiB 9.5 TiB 0 B 10.11 28 TiB N/A N/A N/A 0 B 0 B redacted.rgw.buckets.non-ec 28 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 28 TiB N/A N/A N/A 0 B 0 B --- cut --- I then looked at how those values are determined at https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L509. Apparently "total_bytes" are compared with the capacity of the root_map. I added a debug line and found that the total in my cluster was already at: total=325511007759696 so in excess of 300 TiB - Looking at "ceph df" again this usage seems strange. Looking at how this total is calculated at https://github.com/ceph/ceph/blob/9f723519257eca039126a20aa6a2a7d2dbfb5dba/src/pybind/mgr/pg_autoscaler/module.py#L441, you see that the larger value (max) of "actual_raw_used" vs. "target_bytes*raw_used_rate" is determined and then summed up. I dumped the values for all pools my cluster with yet another line of debug code: ---cut --- pool_id 1 - actual_raw_used=303160109187420.0, target_bytes=0 raw_used_rate=3.0 pool_id 2 - actual_raw_used=5714098884702.0, target_bytes=0 raw_used_rate=3.0 pool_id 19 - actual_raw_used=256550760.0, target_bytes=0 raw_used_rate=3.0 pool_id 21 - actual_raw_used=71433.0, target_bytes=0 raw_used_r
[ceph-users] Re: Eccessive occupation of small OSDs
With failure domain host your max usable cluster capacity is essentially constrained by the total capacity of the smallest host which is 8TB if I read the output correctly. You need to balance your hosts better by swapping drives. On Fri, 31 Mar 2023 at 03:34, Nicola Mori wrote: > Dear Ceph users, > > my cluster is made up of 10 old machines, with uneven number of disks and > disk size. Essentially I have just one big data pool (6+2 erasure code, > with host failure domain) for which I am currently experiencing a very poor > available space (88 TB of which 40 TB occupied, as reported by df -h on > hosts mounting the cephfs) compared to the raw one (196.5 TB). I have a > total of 104 OSDs and 512 PGs for the pool; I cannot increment the PG > number since the machines are old and with very low amount of RAM, and some > of them are already overloaded. > > In this situation I'm seeing a high occupation of small OSDs (500 MB) with > respect to bigger ones (2 and 4 TB) even if the weight is set equal to disk > capacity (see below for ceph osd tree). For example OSD 9 is at 62% > occupancy even with weight 0.5 and reweight 0.75, while the highest > occupancy for 2 TB OSDs is 41% (OSD 18) and 4 TB OSDs is 23% (OSD 79). I > guess this high occupancy for 500 MB OSDs combined with erasure code size > and host failure domain might be the cause of the poor available space, > could this be true? The upmap balancer is currently running but I don't > know if and how much it could improve the situation. > Any hint is greatly appreciated, thanks. > > Nicola > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 196.47754 root default > -7 14.55518 host aka > 4hdd1.81940 osd.4 up 1.0 1.0 > 11hdd1.81940 osd.11up 1.0 1.0 > 18hdd1.81940 osd.18up 1.0 1.0 > 26hdd1.81940 osd.26up 1.0 1.0 > 32hdd1.81940 osd.32up 1.0 1.0 > 41hdd1.81940 osd.41up 1.0 1.0 > 48hdd1.81940 osd.48up 1.0 1.0 > 55hdd1.81940 osd.55up 1.0 1.0 > -3 14.55518 host balin > 0hdd1.81940 osd.0 up 1.0 1.0 > 8hdd1.81940 osd.8 up 1.0 1.0 > 15hdd1.81940 osd.15up 1.0 1.0 > 22hdd1.81940 osd.22up 1.0 1.0 > 29hdd1.81940 osd.29up 1.0 1.0 > 34hdd1.81940 osd.34up 1.0 1.0 > 43hdd1.81940 osd.43up 1.0 1.0 > 49hdd1.81940 osd.49up 1.0 1.0 > -13 29.10950 host bifur > 3hdd3.63869 osd.3 up 1.0 1.0 > 14hdd3.63869 osd.14up 1.0 1.0 > 27hdd3.63869 osd.27up 1.0 1.0 > 37hdd3.63869 osd.37up 1.0 1.0 > 50hdd3.63869 osd.50up 1.0 1.0 > 59hdd3.63869 osd.59up 1.0 1.0 > 64hdd3.63869 osd.64up 1.0 1.0 > 69hdd3.63869 osd.69up 1.0 1.0 > -17 29.10950 host bofur > 2hdd3.63869 osd.2 up 1.0 1.0 > 21hdd3.63869 osd.21up 1.0 1.0 > 39hdd3.63869 osd.39up 1.0 1.0 > 57hdd3.63869 osd.57up 1.0 1.0 > 66hdd3.63869 osd.66up 1.0 1.0 > 72hdd3.63869 osd.72up 1.0 1.0 > 76hdd3.63869 osd.76up 1.0 1.0 > 79hdd3.63869 osd.79up 1.0 1.0 > -21 29.10376 host dwalin > 88hdd1.81898 osd.88up 1.0 1.0 > 89hdd1.81898 osd.89up 1.0 1.0 > 90hdd1.81898 osd.90up 1.0 1.0 > 91hdd1.81898 osd.91up 1.0 1.0 > 92hdd1.81898 osd.92up 1.0 1.0 > 93hdd1.81898 osd.93up 1.0 1.0 > 94hdd1.81898 osd.94up 1.0 1.0 > 95hdd1.81898 osd.95up 1.0 1.0 > 96hdd1.81898 osd.96up 1.0 1.0 > 97hdd1.81898 osd.97up 1.0 1.0 > 98hdd1.81898 osd.98up 1.0 1.0 > 99hdd1.81898 osd.99up 1.0 1.0 > 100hdd1.81898 osd.100 up 1.0
[ceph-users] External Auth (AssumeRoleWithWebIdentity) , STS by default, generic policies and isolation by ownership
Hello ceph-users, unhappy with the capabilities in regards to bucket access policies when using the Keystone authentication module I posted to this ML a while back - https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/S2TV7GVFJTWPYA6NVRXDL2JXYUIQGMIN/ In general I'd still like to hear how others are making use of external authentication and STS and what your experiences are in replacing e.g. Keystone authentication In the meantime we looked into OIDC authentication (via Keycloak) and the potentials there. While this works in general, AssumeRoleWithWebIdentity comes back with an STS token and that can be used to access S3 buckets, I am wondering about a few things: 1) How to enable STS for everyone (without user-individual policy to AssumeRole) In the documentation on STS (https://docs.ceph.com/en/quincy/radosgw/STS/#sts-in-ceph) and also STS-Lite (https://docs.ceph.com/en/quincy/radosgw/STSLite/#sts-lite) it's implied at one has to attach an dedicated policy to allow for STS to each user individually. This does not scale well with thousands of users. Also when using a federated / external authentication, there is no explicit user creation "A shadow user is created corresponding to every federated user. The user id is derived from the ‘sub’ field of the incoming web token." Is there a way to automatically have a role corresponding to each user that can be assumed via a OIDC token? So an implicit role that would allow for an externally authenticated user to have full access to S3 and all buckets owned? Looking at STS Lite documentation, it seems all the more natural to be able to allow keystone users to make use of STS. Is there any way to apply such an AssumeRole policy "globally" or for a whole set of users at the same time? I just found PR https://github.com/ceph/ceph/pull/44434 aiming to add policy variables such as ${aws:username} to allow for generic policies. But this is more about restricting bucket names or granting access to certain pattern of names. 2) Isolation in S3 Multi-Tenancy with external IdP (AssumeRoleWithWebIdentity), how does bucket ownership come into play? Following the question about generic policies for STS I am wondering about the role (no pun intended) that the bucket ownership or tenant play here? If one creates a role policy of e.g. {"Version":"2012-10-17","Statement":{"Effect":"Allow","Action":"s3:*","Resource":"arn:aws:s3:::*"}} Would this allow someone assuming this role access to all, "*", buckets, or just those owned by the user that created this role policy? In case of Keystone auth the owner of a bucket is the project, not the individual (human) user. So this creates somewhat of a tenant which I'd want to isolate. 3) Allowing users to create their own roles and policies by default Is there a way to allow users to create their own roles and policies to use them by default? All the examples talk about the requirement for admin caps and individual setting of '--caps="user-policy=*'. If there was a default role + policy (question #1) that could be applied to externally authenticated users, I'd like for them to be able to create new roles and policies to grant access to their buckets to other users. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Trying to throttle global backfill
I received a few suggestions, and resolved my issue. Anthony D'Atri suggested mclock (newer than my nautilus version), adding "--osd_recovery_max_single_start 1” (didn’t seem to take), “osd_op_queue_cut_off=high” (which I didn’t get to checking), and pgremapper (from github). Pgremapper did the trick to cancel the backfill which had been initiated by an unfortunate OSD name-changing sequence. Big winner, achieved EXACTLY what I needed, which was to undo an unfortunate recalculation of placement groups. Before: 310842802/17308319325 objects misplaced (1.796%) Ran: pgremapper cancel-backfill --yes After: 421709/17308356309 objects misplaced (0.002%) The “before” scenario was causing over 10GiB/s of backfill traffic. The “after” scenario was a very cool 300-400MiB/s, entirely within the realm of sanity. The cluster is temporarily split between two datacenters, being physically lifted and shifted over a period of a month. Alex Gorbachev also suggested setting osd-recovery-sleep. That was probably the solution I was looking for to throttle backfill operations at the beginning, and I’ll be keeping that in my toolbox, as well. As always, I’m HUGELY appreciative of the community response. I learned a lot in the process, had an outage-inducing scenario rectified very quickly, and got back to work. Thanks so much! Happy to answer any followup questions and return the favor when I can. From: Rice, Christian Date: Wednesday, March 8, 2023 at 3:57 PM To: ceph-users Subject: [EXTERNAL] [ceph-users] Trying to throttle global backfill I have a large number of misplaced objects, and I have all osd settings to “1” already: sudo ceph tell osd.\* injectargs '--osd_max_backfills=1 --osd_recovery_max_active=1 --osd_recovery_op_priority=1' How can I slow it down even more? The cluster is too large, it’s impacting other network traffic ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Trying to throttle global backfill
I have a large number of misplaced objects, and I have all osd settings to “1” already: sudo ceph tell osd.\* injectargs '--osd_max_backfills=1 --osd_recovery_max_active=1 --osd_recovery_op_priority=1' How can I slow it down even more? The cluster is too large, it’s impacting other network traffic ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] Re: Renaming a ceph node
Hi all, so I used the rename-bucket option this morning for OSD node renames, and it was a success. Works great even on Luminous. I looked at the swap-bucket command and I felt it was leaning toward real data migration from old OSDs to new OSDs and I was a bit timid because there wasn’t a second host, just a name change. So when I looked at rename-bucket, it just seemed too simple not to try first. And I did, and it was. I renamed two host buckets (they housed discrete storage classes, so no dangerous loss of data redundancy), and even some rack buckets. sudo ceph osd crush rename-bucket and no data moved. I first thought I’d wait til the hosts were shutdown, but after I stopped the OSDs on the nodes, it seemed safe enough, and it was. In my particular case, I was moving migrating nodes to a new datacenter, just new names and IPs. I also moved a mon/mgr/rgw; and I merely had to delete the mon first, then reprovision it later. The rgw and mgr worked fine. I pre-edited ceph.conf to add the new networks, remove the old mon name, add the new mon name, so on startup it worked. I’m not a ceph admin but I play one on the tele. From: Eugen Block Date: Wednesday, February 15, 2023 at 12:44 AM To: ceph-users@ceph.io Subject: [EXTERNAL] [ceph-users] Re: Renaming a ceph node Hi, I haven't done this in a production cluster yet, only in small test clusters without data. But there's a rename-bucket command: ceph osd crush rename-bucket rename bucket to It should do exactly that, just rename the bucket within the crushmap without changing the ID. That command also exists in Luminous, I believe. To have an impression of the impact I'd recommend to test in a test cluster first. Regards, Eugen Zitat von Manuel Lausch : > Hi, > > yes you can rename a node without massive rebalancing. > > The following I tested with pacific. But I think this should work with > older versions as well. > You need to rename the node in the crushmap between shutting down the > node with the old name and starting it with the new name. > You only must keep the ID from the node in the crushmap! > > Regards > Manuel > > > On Mon, 13 Feb 2023 22:22:35 + > "Rice, Christian" wrote: > >> Can anyone please point me at a doc that explains the most >> efficient procedure to rename a ceph node WITHOUT causing a massive >> misplaced objects churn? >> >> When my node came up with a new name, it properly joined the >> cluster and owned the OSDs, but the original node with no devices >> remained. I expect this affected the crush map such that a large >> qty of objects got reshuffled. I want no object movement, if >> possible. >> >> BTW this old cluster is on luminous. ☹ >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Renaming a ceph node
Can anyone please point me at a doc that explains the most efficient procedure to rename a ceph node WITHOUT causing a massive misplaced objects churn? When my node came up with a new name, it properly joined the cluster and owned the OSDs, but the original node with no devices remained. I expect this affected the crush map such that a large qty of objects got reshuffled. I want no object movement, if possible. BTW this old cluster is on luminous. ☹ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Status of Quincy 17.2.5 ?
Hey everyone, On 20/10/2022 10:12, Christian Rohmann wrote: 1) May I bring up again my remarks about the timing: On 19/10/2022 11:46, Christian Rohmann wrote: I believe the upload of a new release to the repo prior to the announcement happens quite regularly - it might just be due to the technical process of releasing. But I agree it would be nice to have a more "bit flip" approach to new releases in the repo and not have the packages appear as updates prior to the announcement and final release and update notes. By my observations sometimes there are packages available on the download servers via the "last stable" folders such as https://download.ceph.com/debian-quincy/ quite some time before the announcement of a release is out. I know it's hard to time this right with mirrors requiring some time to sync files, but would be nice to not see the packages or have people install them before there are the release notes and potential pointers to changes out. Todays 16.2.11 release shows the exact issue I described above 1) 16.2.11 packages are already available via e.g. https://download.ceph.com/debian-pacific 2) release notes not yet merged: (https://github.com/ceph/ceph/pull/49839), thus https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 404 :-) 3) No announcement like https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3D3ZTB5C5BJT5WRSJL5CVSD/ to the ML yet. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD slow ops warning not clearing after OSD down
Hello, On 04/05/2021 09:49, Frank Schilder wrote: I created a ticket: https://tracker.ceph.com/issues/50637 We just observed this very issue on Pacific (16.2.10) , which I also commented on the ticket. I wonder if this case is so seldom, first having some issues causing slow ops and then a total failure of an OSD ? Would be nice to fix this though to not "block" the warning status with something that's not actually a warning. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.11 branch
On 15/12/2022 10:31, Christian Rohmann wrote: May I kindly ask for an update on how things are progressing? Mostly I am interested on the (persisting) implications for testing new point releases (e.g. 16.2.11) with more and more bugfixes in them. I guess I just have not looked on the right ML, it's being worke on already ... https://lists.ceph.io/hyperkitty/list/d...@ceph.io/thread/CQPQJXD6OVTZUH43I4U3GGOP2PKYOREJ/ Sorry for the nagging, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.11 branch
Hey Laura, Greg, all, On 31/10/2022 17:15, Gregory Farnum wrote: If you don't mind me asking Laura, have those issues regarding the testing lab been resolved yet? There are currently a lot of folks working to fix the testing lab issues. Essentially, disk corruption affected our ability to reach quay.ceph.io. We've made progress this morning, but we are still working to understand the root cause of the corruption. We expect to re-deploy affected services soon so we can resume testing for v16.2.11. We got a note about this today, so I wanted to clarify: For Reasons, the sepia lab we run teuthology in currently uses a Red Hat Enterprise Virtualization stack — meaning, mostly KVM with a lot of fancy orchestration all packaged up, backed by Gluster. (Yes, really — a full Ceph integration was never built and at one point this was deemed the most straightforward solution compared to running all-up OpenStack backed by Ceph, which would have been the available alternative.) The disk images stored in Gluster started reporting corruption last week (though Gluster was claiming to be healthy), and with David's departure and his backup on vacation it took a while for the remaining team members to figure out what was going on and identify strategies to resolve or work around it. The relevant people have figured out a lot more of what was going on, and Adam (David's backup) is back now so we're expecting things to resolve more quickly at this point. And indeed the team's looking at other options for providing this infrastructure going forward. -Greg May I kindly ask for an update on how things are progressing? Mostly I am interested on the (persisting) implications for testing new point releases (e.g. 16.2.11) with more and more bugfixes in them. Thanks a bunch! Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW Forcing buckets to be encrypted (SSE-S3) by default (via a global bucket encryption policy)?
Hey ceph-users, loosely related to my question about client-side encryption in the Cloud Sync module (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/I366AIAGWGXG3YQZXP6GDQT4ZX2Y6BXM/) I am wondering if there are other options to ensure data is encrypted at rest and also only replicated as encrypted data ... My thoughts / findings so far: AWS S3 supports setting a bucket encryption policy (https://docs.aws.amazon.com/AmazonS3/latest/userguide/default-bucket-encryption.html) to "ApplyServerSideEncryptionByDefault" - so automatically apply SSE to all objects without the clients to explicitly request this per object. Ceph RGW has received support for such policy via the bucket encryption API with https://github.com/ceph/ceph/commit/95acefb2f5e5b1a930b263bbc7d18857d476653c. I am now just wondering if there is any way to not only allow bucket creators to apply such a policy themselves, but to apply this as a global default in RGW, forcing all buckets to have SSE enabled - transparently. If there is no way to achieve this just yet, what are your thoughts about adding such an option to RGW? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cloud sync to minio fails after creating the bucket
On 21/11/2022 12:50, ma...@roterruler.de wrote: Could this "just" be the bug https://tracker.ceph.com/issues/55310 (duplicate https://tracker.ceph.com/issues/57807) about Cloud Sync being broken since Pacific? Wow - yes, the issue seems to be exactly the same that I'm facing -.- But there is a fix commited, pending backports to Quincy / Pacific: https://tracker.ceph.com/issues/57306 Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cloud sync to minio fails after creating the bucket
On 21/11/2022 11:04, ma...@roterruler.de wrote: Hi list, I'm currently implementing a sync between ceph and a minio cluster to continously sync the buckets and objects to an offsite location. I followed the guide on https://croit.io/blog/setting-up-ceph-cloud-sync-module After the sync starts it successfully creates the first bucket, but somehow tries over and over again to create the bucket instead of adding the objects itself. This is from the minio logs: Could this "just" be the bug https://tracker.ceph.com/issues/55310 (duplicate https://tracker.ceph.com/issues/57807) about Cloud Sync being broken since Pacific? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW replication and multiple endpoints
Hey Kamil On 14/11/2022 13:54, Kamil Madac wrote: Hello, I'm trying to create a RGW Zonegroup with two zones, and to have data replicated between the zones. Each zone is separate Ceph cluster. There is a possibility to use list of endpoints in zone definitions (not just single endpoint) which will be then used for the replication between zones. so I tried to use it instead of using LB in front of clusters for the replication . [...] When node is back again, replication continue to work. What is the reason to have possibility to have multiple endpoints in the zone configuration when outage of one of them makes replication not working? We are running a similar setup and ran into similar issues before when doing rolling restarts of the RGWs. 1) Mostly it's a single metadata shard never syncing up and requireing a complete "metadata init". But this issue will likely be address via https://tracker.ceph.com/issues/39657 2) But we also observed issues with one RGW being unavailable or just slow and as a result influencing the whole sync process. I suppose the HTTP client used within rgw syncer does not do a good job of tracking which remote RGW is healthy or a slow reading RGW could just be locking all the shards ... 3) But as far as "cooperating" goes there are improvements being worked on, see https://tracker.ceph.com/issues/41230 or https://github.com/ceph/ceph/pull/45958 which then makes better use of having multiple distinct RGW in both zones. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Upgrade/migrate host operating system for ceph nodes (CentOS/Rocky)
Hi all, we're running a ceph cluster with v15.2.17 and cephadm on various CentOS hosts. Since CentOS 8.x is EOL, we'd like to upgrade/migrate/reinstall the OS, possibly migrating to Rocky or CentOS stream: host | CentOS | Podman -|--|--- osd* | 7.9.2009 | 1.6.4 x5 osd* | 8.4.2105 | 3.0.1 x2 mon0 | 8.4.2105 | 3.2.3 mon1 | 8.4.2105 | 3.0.1 mon2 | 8.4.2105 | 3.0.1 mds* | 7.9.2009 | 1.6.4 x2 We have a few specific questions: 1) Does anyone have experience using Rocky Linux 8 or 9 or CentOS stream with ceph? Rocky is not mentioned specifically in the cephadm docs [2]. 2) Is the Podman compatibility list [1] still up to date? CentOS Stream 8 as of 2022-10-19 appears to have Podman version 4.x, IIRC. 4.x does not appear in the compatibility table. Anyone using Podman 4.x successfully (with which ceph version)? Thanks in advance, Chris [1]: https://docs.ceph.com/en/quincy/cephadm/compatibility/#compatibility-with-podman-versions [2]: https://docs.ceph.com/en/quincy/cephadm/install/#cephadm-install-distros ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.11 branch
On 28/10/2022 00:25, Laura Flores wrote: Hi Oleksiy, The Pacific RC has not been declared yet since there have been problems in our upstream testing lab. There is no ETA yet for v16.2.11 for that reason, but the full diff of all the patches that were included will be published to ceph.io when v16.2.11 is released. There will also be a diff published in the documentation on this page: https://docs.ceph.com/en/latest/releases/pacific/ In the meantime, here is a link to the diff in commits between v16.2.10 and the Pacific branch: https://github.com/ceph/ceph/compare/v16.2.10...pacific There also is https://tracker.ceph.com/versions/656 which seems to be tracking the open issues tagged for this particular point release. If you don't mind me asking Laura, have those issues regarding the testing lab been resolved yet? There are quite a few bugfixes in the pending release 16.2.11 which we are waiting for. TBH I was about to ask if it would not be sensible to do an intermediate release and not let it grow bigger and bigger (with even more changes / fixes) going out at once. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Using multiple SSDs as DB
Thank you! Robert Sander schrieb am Fr. 21. Okt. 2022 > This is a bug in certain versions of ceph-volume: > > https://tracker.ceph.com/issues/56031 > > It should be fixed in the latest releases. For completeness's sake: The cluster is on 16.2.10. Issue is resolved and marked as backported. 16.2.10 was released shortly before the backport. Fixed version for Pacific should be 16.2.11. A partial workaround I found, was limiting data_devices to 8 and db_devices to 1. This resulted in correct db usage for one db device. I then tried 16 data 2 db: This did not work: it (would have) resulted in extra 8 Ceph OSDs with no db device. Best, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Using multiple SSDs as DB
Hi, I have a problem fully utilizing some disks with cephadm service spec. The host I have has the following disks: 4 SSD 900GB 32 HDD 10TB I would like to use the SSDs as DB devices and the HDD devices as block. 8 HDDs per SSD and the available size for the DB would be about 111GB (900GB/8). The spec I used does not fully utilize the SSDs though. Instead of 1/8th of the SSD, it uses about 28GB, so 1/32th of the SSD. The spec I use: spec: objectstore: bluestore filter_logic: AND data_devices: rotational: 1 db_devices: rotational: 0 I saw "limit" in the docs but it sounds like it would limit the amount of SSDs used for DB devices. How can I use all of the SSDs‘ capacity? Best, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Status of Quincy 17.2.5 ?
On 19/10/2022 16:30, Laura Flores wrote: Dan is correct that 17.2.5 is a hotfix release. There was a flaw in the release process for 17.2.4 in which five commits were not included in the release. The users mailing list will hear an official announcement about this hotfix release later this week. Thanks for the info. 1) May I bring up again my remarks about the timing: On 19/10/2022 11:46, Christian Rohmann wrote: I believe the upload of a new release to the repo prior to the announcement happens quite regularly - it might just be due to the technical process of releasing. But I agree it would be nice to have a more "bit flip" approach to new releases in the repo and not have the packages appear as updates prior to the announcement and final release and update notes. By my observations sometimes there are packages available on the download servers via the "last stable" folders such as https://download.ceph.com/debian-quincy/ quite some time before the announcement of a release is out. I know it's hard to time this right with mirrors requiring some time to sync files, but would be nice to not see the packages or have people install them before there are the release notes and potential pointers to changes out. 2) Also in cases as with the 17.2.4 release containing a regression it would be great to have the N release and N-1 there to allow users to downgrade to a previous point-release quickly in case they run into issues. Otherwise one needs to configure the N-1 repo manually to still have access to the N-1 release. And with this just being links in the filesystem this should not even take make space on the download servers or their mirrors. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Mirror de.ceph.com broken?
Hey ceph-users, it seems that the German ceph mirror http://de.ceph.com/ <http://de.ceph.com/> listed at https://docs.ceph.com/en/latest/install/mirrors/#locations does not hold any data. The index page shows some plesk default page and also deeper links like http://de.ceph.com/debian-17.2.4/ return 404. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Status of Quincy 17.2.5 ?
On 19/10/2022 11:26, Chris Palmer wrote: I've noticed that packages for Quincy 17.2.5 appeared in the debian 11 repo a few days ago. However I haven't seen any mention of it anywhere, can't find any release notes, and the documentation still shows 17.2.4 as the latest version. Is 17.2.5 documented and ready for use yet? It's a bit risky having it sitting undocumented in the repo for any length of time when it might inadvertently be applied when doing routine patching... (I spotted it, but one day someone might not). I believe the upload of a new release to the repo prior to the announcement happens quite regularly - it might just be due to the technical process of releasing. But I agree it would be nice to have a more "bit flip" approach to new releases in the repo and not have the packages appear as updates prior to the announcement and final release and update notes. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process
Hey Boris, On 07/10/2022 11:30, Boris Behrens wrote: I just wanted to reshard a bucket but mistyped the amount of shards. In a reflex I hit ctrl-c and waited. It looked like the resharding did not finish so I canceled it, and now the bucket is in this state. How can I fix it. It does not show up in the stale-instace list. It's also a multisite environment (we only sync metadata). I believe resharding is not supported with rgw multisite (https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#multisite) but is being worked on / implemented fpr the Quincy release, see https://tracker.ceph.com/projects/rgw/issues?query_id=247 But you are not syncing the data in your deployment? Maybe that's a different case then? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW multisite Cloud Sync module with support for client side encryption?
Hello Ceph-Users, I have a question regarding support for any client side encryption in the Cloud Sync Module for RGW (https://docs.ceph.com/en/latest/radosgw/cloud-sync-module/). While a "regular" multi-site setup (https://docs.ceph.com/en/latest/radosgw/multisite/) is usually syncing data between Ceph clusters, RGWs and other supporting infrastructure in the same administrative domain this might be different when looking at cloud sync. One could setup a sync to e.g. AWS S3 or any other compatible S3 implementation that is provided as a service and by another provider. 1) I was wondering if there is any transparent way to apply client side encryption to those objects that are sent to the remote service? Even something the likes of a single static key (see https://github.com/ceph/ceph/blob/1c9e84a447bb628f2235134f8d54928f7d6b7796/doc/radosgw/encryption.rst#automatic-encryption-for-testing-only) would protect against the remote provider being able to look at the data. 2) What happens to objects that are encrypted on the source RGW and via SSE-S3? (https://docs.ceph.com/en/quincy/radosgw/encryption/#sse-s3) I suppose those remain encrypted? But this does require users to actively make use of SSE-S3, right? Thanks again with kind regards, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Suggestion to build ceph storage
On Sun, 19 Jun 2022 at 02:29, Satish Patel wrote: > Greeting folks, > > We are planning to build Ceph storage for mostly cephFS for HPC workload > and in future we are planning to expand to S3 style but that is yet to be > decided. Because we need mass storage, we bought the following HW. > > 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We > understand SSD/NvME would be best fit but it's way out of budget. > > I hope you have extra HW on hand for Monitor and MDS servers > Ceph recommends using a faster disk for wal/db if the data disk is slow and > in my case I do have a slower disk for data. > > Question: > 1. Let's say if i want to put a NvME disk for wal/db then what size i > should buy. > The official recommendation is to budget 4% of OSD size for WAL/DB - so in your case that would be 720GB per OSD. Especially if you want to go to S3 later you should stick closer to that limit since RGW is a heavy meta data user. Also with 12 OSD per node you should have at least 2 NVME - so 2x4TB might do or maybe 3x3TB The WAL/DB device is a Single Point of Failure for all OSDs attached (in other words - if the WAL/DB device fails then all OSDs that have their WAL/DB located there need to be rebuilt) Make sure you budget for good number of DWPD (I assume in HPC scenario you'll have a lot of scratch data) and test it with O_DIRECT and F_SYNC and QD=1 and BS=4K to find one that can reliably handle high IOPS under that condition > 2. Do I need to partition wal/db for each OSD or just a single > partition can share for all OSDs? > You need one partition per OSD > 3. Can I put the OS on the same disk where the wal/db is going to sit ? > (This way i don't need to spend extra money for extra disk) > Yes you can but in your case that would mean putting the WAL/DB on the HDD - I would predict your HPC users not being very impressed with the resulting performance but YMMV > Any suggestions you have for this kind of storage would be much > appreciated. > Budget plenty of RAM to deal with recovery scenarios - I'd say in your case 256GB minimum. Normally you build a POC and test the heck out of it to cover your usage scenarios but you already bought the HW so not a lot you can change now - but you should test and tune your setup before you put production data on it to ensure that you have a good idea how the system is going to behave when it get s under load. Make sure you test failure scenarios (failing OSDs, failing nodes, network cuts, failing MDS etc.) so you know what to expect and how to handle them Another bottleneck in CephFS setups tends to be the MDS - again in your setup you probably want at least 2 MDS in active-active (i.e. shared load) plus 1 or 2 on standby as failover but others on this list have more experience with that. ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] Laggy OSDs
we had issues with slow ops on ssd AND nvme; mostly fixed by raising aio-max-nr from 64K to 1M, eg "fs.aio-max-nr=1048576" if I remember correctly. On 3/29/22, 2:13 PM, "Alex Closs" wrote: Hey folks, We have a 16.2.7 cephadm cluster that's had slow ops and several (constantly changing) laggy PGs. The set of OSDs with slow ops seems to change at random, among all 6 OSD hosts in the cluster. All drives are enterprise SATA SSDs, by either Intel or Micron. We're still not ruling out a network issue, but wanted to troubleshoot from the Ceph side in case something broke there. ceph -s: health: HEALTH_WARN 3 slow ops, oldest one blocked for 246 sec, daemons [osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops. services: mon: 5 daemons, quorum ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 (age 28h) mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh, ceph-mon1.iogajr osd: 143 osds: 143 up (since 92m), 143 in (since 2w) rgw: 3 daemons active (3 hosts, 1 zones) data: pools: 26 pools, 3936 pgs objects: 33.14M objects, 144 TiB usage: 338 TiB used, 162 TiB / 500 TiB avail pgs: 3916 active+clean 19 active+clean+laggy 1 active+clean+scrubbing+deep io: client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr This is actually much faster than it's been for much of the past hour, it's been as low as 50 kb/s and dozens of iops in both directions (where the cluster typically does 300MB to a few gigs, and ~4k iops) The cluster has been on 16.2.7 since a few days after release without issue. The only recent change was an apt upgrade and reboot on the hosts (which was last Friday and didn't show signs of problems). Happy to provide logs, let me know what would be useful. Thanks for reading this wall :) -Alex MIT CSAIL he/they ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)
I would not host multiple OSD on a spinning drive (unless it's one of those Seagate MACH.2 drives that have two independent heads) - head seek time will most likely kill performance. The main reason to host multiple OSD on a single SSD or NVME is typically to make use of the large IOPS capacity which cepth can struggle to fully utilize on a single drive. With spinners you usually don't have that "problem" (quite the opposite usually) On Wed, 23 Mar 2022 at 19:29, Boris Behrens wrote: > Good morning Istvan, > those are rotating disks and we don't use EC. Splitting up the 16TB disks > into two 8TB partitions and have two OSDs on one disk also sounds > interesting, but would it solve the problem? > > I also thought to adjust the PGs for the data pool from 4096 to 8192. But I > am not sure if this will solve the problem or make it worse. > > Until now, everything I've tried didn't work. > > Am Mi., 23. März 2022 um 05:10 Uhr schrieb Szabo, Istvan (Agoda) < > istvan.sz...@agoda.com>: > > > Hi, > > > > I think you are having similar issue as me in the past. > > > > I have 1.6B objects on a cluster average 40k and all my osd had spilled > > over. > > > > Also slow ops, wrongly marked down… > > > > My osds are 15.3TB ssds, so my solution was to store block+db together on > > the ssds, put 4 osd/ssd and go up to 100pg/osd so 1 disk holds 400pg > approx. > > Also turned on balancer with upmap and max deviation 1. > > > > I’m using ec 4:2, let’s see how long it lasts. My bottleneck is always > the > > pg number, too small pg number for too many objects. > > > > Istvan Szabo > > Senior Infrastructure Engineer > > --- > > Agoda Services Co., Ltd. > > e: istvan.sz...@agoda.com > > --- > > > > On 2022. Mar 22., at 23:34, Boris Behrens wrote: > > > > Email received from the internet. If in doubt, don't click any link nor > > open any attachment ! > > > > > > The number 180 PGs is because of the 16TB disks. 3/4 of our OSDs had > cache > > SSDs (not nvme though and most of them are 10OSDs one SSD) but this > problem > > only came in with octopus. > > > > We also thought this might be the db compactation, but it doesn't match > up. > > It might happen when the compactation run, but it looks also that it > > happens, when there are other operations like table_file_deletion > > and it happens on OSDs that have SSD backed block.db devices (like 5 OSDs > > share one SAMSUNG MZ7KM1T9HAJM-5 and the IOPS/throughput on the SSD > is > > not huge (100IOPS r/s 300IOPS w/s when compacting an OSD on it, and > around > > 50mb/s r/w throughput) > > > > I also can not reproduce it via "ceph tell osd.NN compact", so I am not > > 100% sure it is the compactation. > > > > What do you mean with "grep for latency string"? > > > > Cheers > > Boris > > > > Am Di., 22. März 2022 um 15:53 Uhr schrieb Konstantin Shalygin < > > k0...@k0ste.ru>: > > > > 180PG per OSD is usually overhead, also 40k obj per PG is not much, but I > > > > don't think this will works without block.db NVMe. I think your "wrong > out > > > > marks" evulate in time of rocksdb compaction. With default log settings > you > > > > can try to grep 'latency' strings > > > > > > Also, https://tracker.ceph.com/issues/50297 > > > > > > > > k > > > > Sent from my iPhone > > > > > > On 22 Mar 2022, at 14:29, Boris Behrens wrote: > > > > > > * the 8TB disks hold around 80-90 PGs (16TB around 160-180) > > > > * per PG we've around 40k objects 170m objects in 1.2PiB of storage > > > > > > > > > > -- > > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > > groüen Saal. > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > -- > > This message is confidential and is for the sole use of the intended > > recipient(s). It may also be privileged or otherwise protected by > copyright > > or other legal rules. If you have received it by mistake please let us > know > > by reply email and delete it from your system. It is prohibited to copy > > this message or disclose its content to anyone. Any confidentiality or > > privilege is not waived or lost by any mistaken delivery or unauthorized > > disclosure of the message. All messages sent to and from Agoda may be > > monitored to ensure compliance with company policies, to protect the > > company's interests and to remove potential malware. Electronic messages > > may be intercepted, amended, lost or deleted, or contain viruses. > > > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___
[ceph-users] Re: How to clear "Too many repaired reads on 1 OSDs" on pacific
On 28/02/2022 20:54, Sascha Vogt wrote: Is there a way to clear the error counter on pacific? If so, how? No, no anymore. See https://tracker.ceph.com/issues/54182 Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Hey Stefan, thanks for getting back to me! On 10/02/2022 10:05, Stefan Schueffler wrote: since my last mail in Dezember, we changed our ceph-setuo like this: we added one SSD osd on each ceph host (which were pure HDD before). Then, we moved the problematic pool "de-dus5.rgw.buckets.index“ to those dedicated SSDs (by adding a corresponding crush map). Since then, no further PG corruptions occurred. This now has a two sided result: on the one side, we now do not observe the problematic behavior anymore, on the other side, this means, by using just spinning HDDs something is buggy with ceph. If the HDD can not fulfill the data IO requirements, then it probably should not lead to data/PG corruption… And, just a blind guess, we only have a few IO requests in our RGW gateway per second - even with spinning HDDs there should not be a problem to store / update the index pool. I would guess that it correlates with our setup having 7001 shards in the problematic bucket, and the implementation of „multisite“ feature, which will do 7001 „status“ requests per second to check and synchronize between the different rgw sites. And _this_ amount of random IO can not be satisfied by utilizing HDDs… Anyway it should not lead to corrupted PGs. We also have a multi-site setup and and and have one HDD-only and one cluster (primary) with NVME SSD for the OSD journaling. There are more inconsistencies on the HDD-only cluster, but we do observe those on the other cluster as well. If you follow the issue at https://tracker.ceph.com/issues/53663 there is even another user (Dieter Roels) observing this issue now. He is talking about RADOSGW crashes potentially causing the inconsistencies. We already guessed it could be rolling restarts. But we cannot put our finger on it yet. And yes, no amount of IO contention should ever cause data corruption. In this case I believe there might be a correlation to the multisite feature hitting OMAP and stored metadata much harder than with regular RADOSGW usage. And if there is a race condition or missing lock /semaphore or something along this line, this certainly is affected by the latency on the underlying storage. Could you maybe trigger manual a deep-scrub on all your OSDs, just to see if that does anything? Thanks again for keeping in touch! Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Hey there again, there now was a question from Neha Ojha in https://tracker.ceph.com/issues/53663 about providing OSD debug logs for a manual deep-scrub on (inconsistent) PGs. I did provide the logs of two of those deep-scrubs via ceph-post-file already. But since data inconsistencies are the worse of bugs and adding some unpredictability to their occurrence we likely need more evidence to have a chance to narrow this down. And since you seem to observe something similar, could you gather and post debug info about them to the ticket as well maybe? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Hello Ceph-Users! On 22/12/2021 00:38, Stefan Schueffler wrote: The other Problem, regarding the OSD scrub errors, we have this: ceph health detail shows „PG_DAMAGED: Possible data damage: x pgs inconsistent.“ Every now and then new pgs get inconsistent. All inconsistent pgs belong to the buckets-index-pool de-dus5.rgw.buckets.index ceph health detail pg 136.1 is active+clean+inconsistent, acting [8,3,0] rados -p de-dus5.rgw.buckets.index list-inconsistent-obj 136.1 No scrub information available for pg 136.1 error 2: (2) No such file or directory rados list-inconsistent-obj 136.1 No scrub information available for pg 136.1 error 2: (2) No such file or directory ceph pg deep-scrub 136.1 instructing pg 136.1 on osd.8 to deep-scrub … until now nothing changed, the list-inconsistent-obj does not show any information (did i miss some cli arguments?) Ususally, we simply do a ceph pg repair 136.1 which most of the time silently does whatever it is supposed to do, and the error disappears. Shortly after, it reappears at random, with some other (or the same) pg out of the rgw.buckets.index - pool… Strange you don't see any actual inconsistent objects ... 1) For me it's usually looking at which pool actually has inconsistencies via e.g. : $ for pool in $(rados lspools); do echo "${pool} $(rados list-inconsistent-pg ${pool})"; done device_health_metrics [] .rgw.root [] zone.rgw.control [] zone.rgw.meta [] zone.rgw.log ["5.3","5.5","5.a","5.b","5.10","5.11","5.19","5.1a","5.1d","5.1e"] zone.rgw.otp [] zone.rgw.buckets.index ["7.4","7.5","7.6","7.9","7.b","7.11","7.13","7.14","7.18","7.1e"] zone.rgw.buckets.data [] zone.rgw.buckets.non-ec [] (This is from now) and you can see how only metadata pools are actually affected. 2) I then simply looped over the pgs with "rados list-inconsistent-obj $pg" and this is the object.name, errors and last_reqid: "data_log.14","omap_digest_mismatch","client.4349063.0:12045734" "data_log.59","omap_digest_mismatch","client.4364800.0:11773451" "data_log.30","omap_digest_mismatch","client.4349063.0:10935030" "data_log.42","omap_digest_mismatch","client.4348139.0:112695680" "data_log.63","omap_digest_mismatch","client.4348139.0:116876563" "data_log.44","omap_digest_mismatch","client.4349063.0:11358410" "data_log.11","omap_digest_mismatch","client.4349063.0:10259566" "data_log.61","omap_digest_mismatch","client.4349063.0:10259594" "data_log.28","omap_digest_mismatch","client.4349063.0:11358396" "data_log.39","omap_digest_mismatch","client.4349063.0:11364174" "data_log.55","omap_digest_mismatch","client.4349063.0:11358415" "data_log.15","omap_digest_mismatch","client.4364800.0:9518143" "data_log.27","omap_digest_mismatch","client.4349063.0:11473205" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.6","omap_digest_mismatch","client.4349063.0:11274164" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2217176.214.1","omap_digest_mismatch","client.4349063.0:12168097" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2217176.214.10","omap_digest_mismatch","client.4348139.0:112993744" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2202949.678.0","omap_digest_mismatch","client.4349063.0:10289913" ".dir.9cba42a3-dd1c-46d4-bdd2-ef634d12c0a5.56337947.1562","omap_digest_mismatch","client.4364800.0:10934595" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.9","omap_digest_mismatch","client.4349063.0:10431941" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.0","omap_digest_mismatch","client.4349063.0:10431932" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2202949.678.10","omap_digest_mismatch","client.4349063.0:10460106" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.1163207.114.8","omap_digest_mismatch","client.4349063.0:11696943" ".dir.06f9b7c7-6326-4a41-9115-d4d092cf74ce.2217176.214.0","omap_digest_mismatch","client.4349063.0:9845513" ".dir.9cba42a3-dd1c-46d4-bdd2-ef634d12c0a5.61963196.333.1","omap_digest_mismatch","client.4364800.0:9593089" As you can see, it's always some omap data that suffers from inconsistencies. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pg_autoscaler using uncompressed bytes as pool current total_bytes triggering false POOL_TARGET_SIZE_BYTES_OVERCOMMITTED warnings?
raw_used=10035209699328.0, target_bytes=5497558138880 raw_used_rate=3.0 pool_id 28 - actual_raw_used=0.0, target_bytes=0 raw_used_rate=3.0 --- cut --- All values but those of pool_id 1 (backups) make sense. For backups it's just reporting a MUCH larger actual_raw_used value than what is shown via ceph df. The only difference of that pool compared to the others is the enabled compression: --- cut --- # ceph osd pool get backups compression_mode compression_mode: aggressive --- cut --- Apparently there already was a similar issue (https://tracker.ceph.com/issues/41567) with a resulting commit (https://github.com/ceph/ceph/commit/dd6e752826bc762095be4d276e3c1b8d31293eb0) changing which from "bytes_used" to the "stored" field for "pool_logical_used". But how does that take compressed (away) data into account? Does "bytes_used" count all the "stored" bytes, summing up all uncompressed bytes for pools with compression? This surely must be a bug then, as those bytes are not really "actual_raw_used". I was about to raise a bug, but I wanted to ask here on the ML first if I misunderstood the mechanisms at play here. Thanks and with kind regards, Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Thanks for your response Stefan, On 21/12/2021 10:07, Stefan Schueffler wrote: Even without adding a lot of rgw objects (only a few PUTs per minute), we have thousands and thousands of rgw bucket.sync log entries in the rgw log pool (this seems to be a separate problem), and as such we accumulate „large omap objects“ over time. Since you are doing RADOSGW as well, those OMAP objects are usually bucket index files (https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects <https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects>). Since there is no dynamic resharing (https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#rgw-dynamic-bucket-index-resharding) until Quincy (https://tracker.ceph.com/projects/rgw/issues?utf8=%E2%9C%93_filter=1%5B%5D=cf_3%5Bcf_3%5D=%3D%5Bcf_3%5D%5B%5D=multisite-reshard%5B%5D=%5B%5D=project%5B%5D=tracker%5B%5D=status%5B%5D=priority%5B%5D=subject%5B%5D=assigned_to%5B%5D=updated_on%5B%5D=category%5B%5D=fixed_version%5B%5D=cf_3_by=%5B%5D=) you need to have enough shards created for each bucket by default. At about 200k objects (~ keys) per shards you should reveive this warning otherwise (used to be 2mio, see https://github.com/ceph/ceph/pull/29175/files). we also face the same or at least a very similar problem. We are running pacific (16.2.6 and 16.2.7, upgraded from 16.2.x to y to z) on both sides of the rgw multisite. In our case, the scrub errors occur on the secondary side only Regarding your scrub errors. You do have those still coming up at random? Could you check with "list-inconsistent-obj" if yours are within the OMAP data and in the metadata pools only? Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Hello Eugen, On 20/12/2021 22:02, Eugen Block wrote: you wrote that this cluster was initially installed with Octopus, so no upgrade ceph wise? Are all RGW daemons on the exact same ceph (minor) versions? I remember one of our customers reporting inconsistent objects on a regular basis although no hardware issues were detectable. They replicate between two sites, too. A couple of months ago both sites were updated to the same exact ceph minor version (also Octopus), they haven't faced inconsistencies since then. I don't have details about the ceph version(s) though, only that both sites were initially installed with Octopus. Maybe it's worth checking your versions? Yes, everything has the same version: { [...] "overall": { "ceph version 15.2.15 (2dfb18841cfecc2f7eb7eb2afd65986ca4d95985) octopus (stable)": 34 } } I just observed another 3 scrub errors. Strangely they never see to have occurred on the same pgs again. I shall be running another deep scrub on those OSD again to narrow this down. But I am somewhat suspecting this to be a potential issue with the OMAP validation part of the scrubbing. For RADOSGW there are large OMAP structures with lots of movement. And the issues only are with the metadata pools. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Random scrub errors (omap_digest_mismatch) on pgs of RADOSGW metadata pools (bug 53663)
Hello Ceph-Users, for about 3 weeks now I see batches of scrub errors on a 4 node Octopus cluster: # ceph health detail HEALTH_ERR 7 scrub errors; Possible data damage: 6 pgs inconsistent [ERR] OSD_SCRUB_ERRORS: 7 scrub errors [ERR] PG_DAMAGED: Possible data damage: 6 pgs inconsistent pg 5.3 is active+clean+inconsistent, acting [9,12,6] pg 5.4 is active+clean+inconsistent, acting [15,17,18] pg 7.2 is active+clean+inconsistent, acting [13,15,10] pg 7.9 is active+clean+inconsistent, acting [5,19,4] pg 7.e is active+clean+inconsistent, acting [1,15,20] pg 7.18 is active+clean+inconsistent, acting [5,10,0] this cluster only serves RADOSGW and it's a multisite master. I already found another thread (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/LXMQSRNSCPS5YJMFXIS3K5NMROHZKDJU/), but with no recent comments about such an issue. In my case I am still seeing more scrub errors every few days. All those inconsistencies are "omap_digest_mismatch" in the "zone.rgw.log" or "zone.rgw.buckets.index" pool and are spread all across nodes and OSDs. I already raised I bug ticket (https://tracker.ceph.com/issues/53663), but am wondering if anybody of you ever observed something similar? Traffic to and from the object storage seems totally fine and I can even run a manual deep-scrub with no errors and then receive 3-4 errors the next day. Is there anything I could look into / collect when the next inconsistency occurs? Could there be any misconfiguration causing this? Thanks and with kind regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus
Hello Tomasz, I observe a strange accumulation of inconsistencies for an RGW-only (+multisite) setup, with errors just like those you reported. I collected some info and raised a bug ticket: https://tracker.ceph.com/issues/53663 Two more inconsistencies have just shown up hours after repairing the other, adding to the theory of something really odd going on. Did you upgrade to Octopus in the end then? Any more issues with such inconsistencies on your side Tomasz? Regards Christian On 20/10/2021 10:33, Tomasz Płaza wrote: As the upgrade process states, rgw are the last one to be upgraded, so they are still on nautilus (centos7). Those logs showed up after upgrade of the first osd host. It is a multisite setup so I am a little afraid of upgrading rgw now. Etienne: Sorry for answering in this thread, but somehow I do not get messages directed only to ceph-users list. I did "rados list-inconsistent-pg" and got many entries like: { "object": { "name": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7", "nspace": "", "locator": "", "snap": "head", "version": 82561410 }, "errors": [ "omap_digest_mismatch" ], "union_shard_errors": [], "selected_object_info": { "oid": { "oid": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7", "key": "", "snapid": -2, "hash": 3316145293, "max": 0, "pool": 230, "namespace": "" }, "version": "107760'82561410", "prior_version": "106468'82554595", "last_reqid": "client.392341383.0:2027385771", "user_version": 82561410, "size": 0, "mtime": "2021-10-19T16:32:25.699134+0200", "local_mtime": "2021-10-19T16:32:25.699073+0200", "lost": 0, "flags": [ "dirty", "omap", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x", "omap_digest": "0x", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 56, "primary": true, "errors": [], "size": 0, "omap_digest": "0xf4cf0e1c", "data_digest": "0x" }, { "osd": 58, "primary": false, "errors": [], "size": 0, "omap_digest": "0xf4cf0e1c", "data_digest": "0x" }, { "osd": 62, "primary": false, "errors": [], "size": 0, "omap_digest": "0x4bd5703a", "data_digest": "0x" } ] } ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] Re: Why you might want packages not containers for Ceph deployments
I think Marc uses containers - but they've chosen Apache Mesos as orchestrator and ceph-adm doesn't work with that. Currently essentially two ceph container orchestrators exist - rook which is a ceph orch or kubernetes and ceph-adm which is an orchestrator expecting docker or podman Admittedly I don't fully understand the nuanced differences between rook (which can be added as a module to the ceph orchestrator cli) and cephadm (no idea how this is related to the ceph orch cli) - they kinda seem to do the same thing but slightly differently (or not?). On Fri, 19 Nov 2021 at 16:51, Tony Liu wrote: > Instead of complaining, take some time to learn more about container would > help. > > Tony > > From: Marc > Sent: November 18, 2021 10:50 AM > To: Pickett, Neale T; Hans van den Bogert; ceph-users@ceph.io > Subject: [ceph-users] Re: [EXTERNAL] Re: Why you might want packages not > containers for Ceph deployments > > > We also use containers for ceph and love it. If for some reason we > > couldn't run ceph this way any longer, we would probably migrate > > everything to a different solution. We are absolutely committed to > > containerization. > > I wonder if you are really using containers. Are you not just using > ceph-adm? If you would be using containers you would have selected your OC > already, and would be pissed about how the current containers are being > developed and have to use a 2nd system. > > > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Question if WAL/block.db partition will benefit us
In addition to what the others said - generally there is little point in splitting block and wal partitions - just stick to one for both. What model are you SSDs and how well do they handle small direct writes? Because that's what you'll be getting on them and the wrong type of SSD can make things worse rather than better. On Tue, 9 Nov 2021 at 00:08, Boris Behrens wrote: > > Hi, > we run a larger octopus s3 cluster with only rotating disks. > 1.3 PiB with 177 OSDs, some with a SSD block.db and some without. > > We have a ton of spare 2TB disks and we just wondered if we can bring the > to good use. > For every 10 spinning disks we could add one 2TB SSD and we would create > two partitions per OSD (130GB for block.db and 20GB for block.wal). This > would leave some empty space on the SSD for waer leveling. > > The question now is: would we benefit from this? Most of the data that is > written to the cluster is very large (50GB and above). This would take a > lot of work into restructuring the cluster and also two other clusters. > > And does it make a different to have only a block.db partition or a > block.db and a block.wal partition? > > Cheers > Boris > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [Ceph] Recovery is very Slow
Yes, just expose each disk as an individual OSD and you'll already be better off. Depending what type of SSD they are - if they can sustain high random write IOPS you may even want to consider partitioning each disk and create 2 OSDs per SSD to make better use of the available IO capacity. For all-flash storage CPU utilization is also a factor - generally fewer cores with a higher clock speed would be preferred over a cpu with more cores but lower clock speeds in such a setup. On Thu, 28 Oct 2021 at 21:25, Lokendra Rathour wrote: > > Hey Janne, > Thanks for the feedback, we only wanted to have huge space to test more with > more data. do you advise some other way to plan this out? > So I have 15 disks with 1 TB each. Creating multiple OSD would help or > please advise. > > thanks, > Lokendra > > > On Thu, Oct 28, 2021 at 1:52 PM Janne Johansson wrote: >> >> Den tors 28 okt. 2021 kl 10:18 skrev Lokendra Rathour >> : >> > >> > Hi Christian, >> > Thanks for the update. >> > I have 5 SSD on each node i.e. a total of 15 SSD using which I have >> > created this RAID 0 Disk, which in Ceph becomes three OSD. Each OSD with >> > around 4.4 TB of disk. and in total it is coming around 13.3 TB. >> > Do you feel local RAID is an issue here? Keeping independent disks can >> > help recovery fast or increase the performance? please advice. >> >> >> That is a very poor way to set up ceph storage. >> >> >> -- >> May the most significant bit of your life be positive. > > > > -- > ~ Lokendra > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Open discussing: Designing 50GB/s CephFS or S3 ceph cluster
- What is the expected file/object size distribution and count? - Is it write-once or modify-often data? - What's your overall required storage capacity? - 18 OSDs per WAL/DB drive seems a lot - recommended is ~6-8 - With 12TB OSD the recommended WAL/DB size is 120-480GB (1-4%) per OSD to avoid spillover - if you go RGW then you may want to aim more towards 4% since RGW can use quite a bit of OMAP data (especially when you store many small objects). Not sure about CephFS - So you may want to look at 4x NVME and probably 3.2TB instead of 1.6 - Rule-of-thumb is 1 Thread per HDD OSD - so if you want to give yourself some extra wiggle room a 7402 might be better - especially since EC is a bit heavier on CPU - Running EC 8+3 with failure domain host means you should have at least 12 nodes which means you'd need to push 4GB/sec/node which seems theoretically possible but is quite close to the network interface capacity. And whether you could actually push 4GB/sec into a node in this config I don't know. But overall 12 nodes seems like the minimum - With 12 nodes you have a raw storage capacity of around 5PB - assuming you don't run you cluster more than 80% full and EC 8+3 means max of 3PB usable data capacity (again assuming your objects are large enough to not cause significant space amplification wrt. bluestore min block size) - You will probably run more nodes than that so if you don't need the actual capacity then consider going replicated instead which generally performs better than EC On Fri, 22 Oct 2021 at 05:24, huxia...@horebdata.cn wrote: > Dear Cephers, > > I am thinking of designing a cephfs or S3 cluster, with a target to > achieve a minimum of 50GB/s (write) bandwidth. For each node, I prefer 4U > 36x 3.5" Supermicro server with 36x 12TB 7200K RPM HDDs, 2x Intel P4610 > 1.6TB NVMe SSD as DB/WAL, a single CPU socket with AMD 7302, and 256GB DDR4 > memory. Each node comes with 2x 25Gb networking in mode 4 bonded. 8+3 EC > will be used. > > My questions are the following: > > 1 How many nodes should be deployed in order to achieve a minimum of > 50GB/s, if possible, with the above hardware setting? > > 2 How many Cephfs MDS are required? (suppose 1MB request size), and how > many clients are needed for reach a total of 50GB/s? > > 3 From the perspective of getting the maximum bandwidth, which one > should i choose, CephFS or Ceph S3? > > Any comments, suggestions, or improvement tips are warmly welcome > > best regards, > > Samuel > > > > huxia...@horebdata.cn > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Metrics for object sizes
On 23/04/2021 03:53, Szabo, Istvan (Agoda) wrote: Objects inside RGW buckets like in couch base software they have their own metrics and has this information. Not as detailed as you would like, but how about using the bucket stats on bucket size and number of objects? $ radosgw-admin bucket stats --bucket mybucket Doing a bucket_size / number_of_objects gives you an average object size per bucket and that certainly is an indication on buckets with rather small objects. Regards Christian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance
Hm, generally ceph is mostly latency sensitive which would more translate into IOPs limits rather than bandwidth. In a single threaded write scenario your throughput is limited by the latency of the write path which is generally network + OSD write path + disk. People have managed to get write latencies under 1ms on all-flash setups but around 0.8ms seems the best you can achieve which generally puts an upper limit of ~1200 IOPS on a single threaded client if you do direct synchronized IO. But there shouldn't really be much in the path that artificially limits bandwidth. Bluestore does deferred writes only for small writes - which are the writes that will hit the WAL, writes larger than that will hit the backing store (i.e HDD) directly. I think the default is 32KB but I could be wrong. Obviously even for small writes the WAL will eventually have to be flushed so your longer term performance is still impacted by your HDD speed. So that might be why for larger block sizes the throughput suffers since they will hit the drives directly It's been pointed out in the past that disabling the HDD write cache can actually improve latency quite substantially (e.g. https://ceph-users.ceph.narkive.com/UU9QMu9W/disabling-write-cache-on-sata-hdds-reduces-write-latency-7-times) - might be worth a try On Wed, 6 Oct 2021 at 10:07, Zakhar Kirpichenko wrote: > I'm not sure, fio might be showing some bogus values in the summary, I'll > check the readings again tomorrow. > > Another thing I noticed is that writes seem bandwidth-limited and don't > scale well with block size and/or number of threads. I.e. one clients > writes at about the same speed regardless of the benchmark settings. A > person on reddit, where I asked this question as well, suggested that in a > replicated pool writes and reads are handled by the primary PG, which would > explain this write bandwidth limit. > > /Z > > On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, > wrote: > >> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly >> decent (as you also state) - the bandwidth automatically follows from that >> so not sure what you're expecting? >> I am a bit puzzled though - by my math 7k IOPS at 4k should only be >> 27MiB/sec - not sure how the 120MiB/sec was achieved >> The read benchmark seems in line with 13k IOPS at 4k making around >> 52MiB/sec bandwidth which again is expected. >> >> >> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko wrote: >> >>> Hi, >>> >>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware, >>> and >>> its performance is kind of disappointing. I would very much appreciate an >>> advice and/or pointers :-) >>> >>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: >>> >>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs >>> 384 GB RAM >>> 2 x boot drives >>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) >>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) >>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) >>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches >>> >>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, >>> apparmor is disabled, energy-saving features are disabled. The network >>> between the CEPH nodes is 40G, CEPH access network is 40G, the average >>> latencies are < 0.15 ms. I've personally tested the network for >>> throughput, >>> latency and loss, and can tell that it's operating as expected and >>> doesn't >>> exhibit any issues at idle or under load. >>> >>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2 >>> smaller NVME drives in each node used as DB/WAL and each HDD allocated . >>> ceph osd tree output: >>> >>> ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF >>> -1 288.37488 root default >>> -13 288.37488 datacenter ste >>> -14 288.37488 rack rack01 >>> -7 96.12495 host ceph01 >>> 0hdd9.38680 osd.0up 1.0 1.0 >>> 1hdd9.38680 osd.1up 1.0 1.0 >>> 2hdd9.38680 osd.2up 1.0 1.0 >>> 3hdd9.38680 osd.3up 1.0 1.0 >>> 4hdd9.38680 osd.4up 1.0 1.0 >>> 5hdd9.38680 osd.5up 1.0 1.0 >>> 6hdd9.38680 osd.6
[ceph-users] Re: CEPH 16.2.x: disappointing I/O performance
Maybe some info is missing but 7k write IOPs at 4k block size seem fairly decent (as you also state) - the bandwidth automatically follows from that so not sure what you're expecting? I am a bit puzzled though - by my math 7k IOPS at 4k should only be 27MiB/sec - not sure how the 120MiB/sec was achieved The read benchmark seems in line with 13k IOPS at 4k making around 52MiB/sec bandwidth which again is expected. On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko wrote: > Hi, > > I built a CEPH 16.2.x cluster with relatively fast and modern hardware, and > its performance is kind of disappointing. I would very much appreciate an > advice and/or pointers :-) > > The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with: > > 2 x Intel(R) Xeon(R) Gold 5220R CPUs > 384 GB RAM > 2 x boot drives > 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL) > 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier) > 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier) > 2 x Intel XL710 NICs connected to a pair of 40/100GE switches > > All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel, > apparmor is disabled, energy-saving features are disabled. The network > between the CEPH nodes is 40G, CEPH access network is 40G, the average > latencies are < 0.15 ms. I've personally tested the network for throughput, > latency and loss, and can tell that it's operating as expected and doesn't > exhibit any issues at idle or under load. > > The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2 > smaller NVME drives in each node used as DB/WAL and each HDD allocated . > ceph osd tree output: > > ID CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF > -1 288.37488 root default > -13 288.37488 datacenter ste > -14 288.37488 rack rack01 > -7 96.12495 host ceph01 > 0hdd9.38680 osd.0up 1.0 1.0 > 1hdd9.38680 osd.1up 1.0 1.0 > 2hdd9.38680 osd.2up 1.0 1.0 > 3hdd9.38680 osd.3up 1.0 1.0 > 4hdd9.38680 osd.4up 1.0 1.0 > 5hdd9.38680 osd.5up 1.0 1.0 > 6hdd9.38680 osd.6up 1.0 1.0 > 7hdd9.38680 osd.7up 1.0 1.0 > 8hdd9.38680 osd.8up 1.0 1.0 > 9 nvme5.82190 osd.9up 1.0 1.0 > 10 nvme5.82190 osd.10 up 1.0 1.0 > -10 96.12495 host ceph02 > 11hdd9.38680 osd.11 up 1.0 1.0 > 12hdd9.38680 osd.12 up 1.0 1.0 > 13hdd9.38680 osd.13 up 1.0 1.0 > 14hdd9.38680 osd.14 up 1.0 1.0 > 15hdd9.38680 osd.15 up 1.0 1.0 > 16hdd9.38680 osd.16 up 1.0 1.0 > 17hdd9.38680 osd.17 up 1.0 1.0 > 18hdd9.38680 osd.18 up 1.0 1.0 > 19hdd9.38680 osd.19 up 1.0 1.0 > 20 nvme5.82190 osd.20 up 1.0 1.0 > 21 nvme5.82190 osd.21 up 1.0 1.0 > -3 96.12495 host ceph03 > 22hdd9.38680 osd.22 up 1.0 1.0 > 23hdd9.38680 osd.23 up 1.0 1.0 > 24hdd9.38680 osd.24 up 1.0 1.0 > 25hdd9.38680 osd.25 up 1.0 1.0 > 26hdd9.38680 osd.26 up 1.0 1.0 > 27hdd9.38680 osd.27 up 1.0 1.0 > 28hdd9.38680 osd.28 up 1.0 1.0 > 29hdd9.38680 osd.29 up 1.0 1.0 > 30hdd9.38680 osd.30 up 1.0 1.0 > 31 nvme5.82190 osd.31 up 1.0 1.0 > 32 nvme5.82190 osd.32 up 1.0 1.0 > > ceph df: > > --- RAW STORAGE --- > CLASS SIZEAVAILUSED RAW USED %RAW USED > hdd253 TiB 241 TiB 13 TiB13 TiB 5.00 > nvme35 TiB 35 TiB 82 GiB82 GiB 0.23 > TOTAL 288 TiB 276 TiB 13 TiB13 TiB 4.42 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL > images 12 256 24 GiB3.15k 73 GiB 0.03 76 TiB > volumes13 256 839 GiB 232.16k 2.5
[ceph-users] Re: Erasure coded pool chunk count k
A couple of notes to this: Ideally you should have at least 2 more failure domains than your base resilience (K+M for EC or size=N for replicated) - reasoning: Maintenance needs to be performed so chances are every now and then you take a host down for a few hours or possibly days to do some upgrade, fix some broken things, etc. This means you're running in degraded state since only K+M-1 shards are available. While in that state a drive in another host dies on you. Now recovery for that is blocked because you have insufficient failure domains available and things start getting a bit uncomfortable depending on how large M is. Or a whole host dies on you in that state ... Generally planning your cluster resources right along the fault lines is going to bite you and cause high levels of stress and anxiety. I know - budgets have a limit but still, there is plenty of history on this list for desperate calls for help simply because clusters were only planned for the happy day case. Unlike replicated pools you cannot change your profile on an EC-pool after it has been created - so if you decide to change EC profile this means creating a new pool and migrating. Just something to keep in mind. On Tue, 5 Oct 2021 at 14:58, Anthony D'Atri wrote: > > The larger the value of K relative to M, the more efficient the raw :: > usable ratio ends up. > > There are tradeoffs and caveats. Here are some of my thoughts; if I’m > off-base here, I welcome enlightenment. > > > > When possible, it’s ideal to have at least K+M failure domains — often > racks, sometimes hosts, chassis, etc. Thus smaller clusters, say with 5-6 > nodes, aren’t good fits for larger sums of K+M if your data is valuable. > > Larger sums of K+M also mean that more drives will be touched by each read > or write, especially during recovery. This could be a factor if one is > IOPS-limited. Same with scrubs. > > When using a pool for, eg. RGW buckets, larger sums of K+M may result in > greater overhead when storing small objects, since Ceph / RGW only AIUI > writes full stripes. So say you have an EC pool of 17,3 on drives with the > default 4kB bluestone_min_alloc size. A 1kB S3 object would thus allocate > 17+3=20 x 4kB == 80kB of storage, which is 7900% overhead. This is an > extreme example to illustrate the point. > > Larger sums of K+M may present more IOPs to each storage drive, dependent > on workload and the EC plugin selected. > > With larger objects (including RBD) the modulo factor is dramatically > smaller. One’s use-case and dataset per-pool may thus inform the EC > profiles that make sense; workloads that are predominately smaller objects > might opt for replication instead. > > There was a post ….. a year ago? suggesting that values with small prime > factors are advantageous, but I never saw a discussion of why that might be. > > In some cases where one might be pressured to use replication with only 2 > copies of data, a 2,2 EC profile might achieve the same efficiency with > greater safety. > > Geo / stretch clusters or ones in challenging environments are a special > case; they might choose values of M equal to or even larger than K. > > That said, I think 4,2 is a reasonable place to *start*, adjusted by one’s > specific needs. You get a raw :: usable ratio of 1.5 without getting too > complicated. > > ymmv > > > > > > > > > > Hi, > > > > It depends of hardware, failure domain, use case, overhead. > > > > I don’t see an easy way to chose k and m values. > > > > - > > Etienne Menguy > > etienne.men...@croit.io > > > > > >> On 4 Oct 2021, at 16:57, Golasowski Martin > wrote: > >> > >> Hello guys, > >> how does one estimate number of chunks for erasure coded pool ( k = ? ) > ? I see that number of m chunks determines the pool’s resiliency, however I > did not find clear guideline how to determine k. > >> > >> Red Hat states that they support only the following combinations: > >> > >> k=8, m=3 > >> k=8, m=4 > >> k=4, m=2 > >> > >> without any rationale behind them. > >> The table is taken from > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/storage_strategies_guide/erasure_code_pools > . > >> > >> Thanks! > >> > >> Regards, > >> Martin > >> > >> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io