[ceph-users] Re: Upgraded 16.2.14 to 16.2.15
Hi, 1. RocksDB options, which I provided to each mon via their configuration files, got overwritten during mon redeployment and I had to re-add mon_rocksdb_options back. IIRC, you didn't use the extra_entrypoint_args for that option but added it directly to the container unit.run file. So it's expected that it's removed after an update. If you want it to persist a container update you should consider using the extra_entrypoint_args: cat mon.yaml service_type: mon service_name: mon placement: hosts: - host1 - host2 - host3 extra_entrypoint_args: - '--mon-rocksdb-options=write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true,bottommost_compression=kLZ4HCCompression,max_background_jobs=4,max_subcompactions=2' Regards, Eugen Zitat von Zakhar Kirpichenko : Hi, I have upgraded my test and production cephadm-managed clusters from 16.2.14 to 16.2.15. The upgrade was smooth and completed without issues. There were a few things which I noticed after each upgrade: 1. RocksDB options, which I provided to each mon via their configuration files, got overwritten during mon redeployment and I had to re-add mon_rocksdb_options back. 2. Monitor debug_rocksdb option got silently reset back to the default 4/5, I had to set it back to 1/5. 3. For roughly 2 hours after the upgrade, despite the clusters being healthy and operating normally, all monitors would run manual compactions very often and write to disks at very high rates. For example, production monitors had their rocksdb:low0 thread write to store.db: monitors without RocksDB compression: ~8 GB/5 min, or ~96 GB/hour; monitors with RocksDB compression: ~1.5 GB/5 min, or ~18 GB/hour. After roughly 2 hours with no changes to the cluster the write rates dropped to ~0.4-0.6 GB/5 min and ~120 MB/5 min respectively. The reason for frequent manual compactions and high write rates wasn't immediately apparent. 4. Crash deployment broke ownership of /var/lib/ceph/FSID/crash /var/lib/ceph/FSID/crash/posted, despite I already fixed it manually after the upgrade to 16.2.14 which had broken it as well. 5. Mgr RAM usage appears to be increasing at a slower rate than it did with 16.2.14, although it's too early to tell whether the issue with mgrs randomly consuming all RAM and getting OOM-killed has been fixed - with 16.2.14 this would normally take several days. Overall, things look good. Thanks to the Ceph team for this release! Zakhar ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Help with deep scrub warnings
Dear Ceph users, in order to reduce the deep scrub load on my cluster I set the deep scrub interval to 2 weeks, and tuned other parameters as follows: # ceph config get osd osd_deep_scrub_interval 1209600.00 # ceph config get osd osd_scrub_sleep 0.10 # ceph config get osd osd_scrub_load_threshold 0.30 # ceph config get osd osd_deep_scrub_randomize_ratio 0.10 # ceph config get osd osd_scrub_min_interval 259200.00 # ceph config get osd osd_scrub_max_interval 1209600.00 In my admittedly poor knowledge of Ceph's deep scrub procedures, these settings should spread the deep scrub operations in two weeks instead of the default one week, lowering the scrub frequency and the related load. But I'm currently getting warnings like: [WRN] PG_NOT_DEEP_SCRUBBED: 56 pgs not deep-scrubbed in time pg 3.1e1 not deep-scrubbed since 2024-02-22T00:22:55.296213+ pg 3.1d9 not deep-scrubbed since 2024-02-20T03:41:25.461002+ pg 3.1d5 not deep-scrubbed since 2024-02-20T09:52:57.334058+ pg 3.1cb not deep-scrubbed since 2024-02-20T03:30:40.510979+ . . . I don't understand the first one, since the deep scrub interval should be two weeks so I don''t expect warnings for PGs which have been deep-scrubbed less than 14 days ago (at the moment I'm writing it's Tue Mar 5 07:39:07 UTC 2024). Moreover, I don't understand why the deep scrub for so many PGs is lagging behind. Is there something wrong in my settings? Thanks in advance for any help, Nicola smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Upgraded 16.2.14 to 16.2.15
Hi, I have upgraded my test and production cephadm-managed clusters from 16.2.14 to 16.2.15. The upgrade was smooth and completed without issues. There were a few things which I noticed after each upgrade: 1. RocksDB options, which I provided to each mon via their configuration files, got overwritten during mon redeployment and I had to re-add mon_rocksdb_options back. 2. Monitor debug_rocksdb option got silently reset back to the default 4/5, I had to set it back to 1/5. 3. For roughly 2 hours after the upgrade, despite the clusters being healthy and operating normally, all monitors would run manual compactions very often and write to disks at very high rates. For example, production monitors had their rocksdb:low0 thread write to store.db: monitors without RocksDB compression: ~8 GB/5 min, or ~96 GB/hour; monitors with RocksDB compression: ~1.5 GB/5 min, or ~18 GB/hour. After roughly 2 hours with no changes to the cluster the write rates dropped to ~0.4-0.6 GB/5 min and ~120 MB/5 min respectively. The reason for frequent manual compactions and high write rates wasn't immediately apparent. 4. Crash deployment broke ownership of /var/lib/ceph/FSID/crash /var/lib/ceph/FSID/crash/posted, despite I already fixed it manually after the upgrade to 16.2.14 which had broken it as well. 5. Mgr RAM usage appears to be increasing at a slower rate than it did with 16.2.14, although it's too early to tell whether the issue with mgrs randomly consuming all RAM and getting OOM-killed has been fixed - with 16.2.14 this would normally take several days. Overall, things look good. Thanks to the Ceph team for this release! Zakhar ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] [RGW] Restrict a subuser to access only one specific bucket
Hi community, I have a user that owns some buckets. I want to create a subuser that has permission to access only one bucket. What can I do to archive this? Thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] debian-reef_OLD?
I likely missed an announcement, and if so, please forgive me. I’m seeing some failure for when running apt on a cluster of ubuntu machines — looks like a directory has changed on https://download.ceph.com/ Was: debian-reef/ Now appears to be: debian-reef_OLD/ Was reef pulled? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs not balanced
The balancer will operate on all pools unless otherwise specified. Josh On Mon, Mar 4, 2024 at 1:12 PM Cedric wrote: > > Did the balancer has enabled pools ? "ceph balancer pool ls" > > Actually I am wondering if the balancer do something when no pools are > added. > > > > On Mon, Mar 4, 2024, 11:30 Ml Ml wrote: > > > Hello, > > > > i wonder why my autobalancer is not working here: > > > > root@ceph01:~# ceph -s > > cluster: > > id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df > > health: HEALTH_ERR > > 1 backfillfull osd(s) > > 1 full osd(s) > > 1 nearfull osd(s) > > 4 pool(s) full > > > > => osd.17 was too full (92% or something like that) > > > > root@ceph01:~# ceph osd df tree > > ID CLASS WEIGHT REWEIGHT SIZE ... %USE ... PGS TYPE NAME > > -25 209.50084 - 213 TiB ... 69.56 ... - datacenter > > xxx-dc-root > > -19 84.59369 - 86 TiB ... 56.97 ... - rack > > RZ1.Reihe4.R10 > > -3 35.49313 - 37 TiB ... 57.88 ... - host > > ceph02 > > 2hdd1.7 1.0 1.7 TiB ... 58.77 ... 44 > > osd.2 > > 3hdd1.0 1.0 2.7 TiB ... 22.14 ... 25 > > osd.3 > > 7hdd2.5 1.0 2.7 TiB ... 58.84 ... 70 > > osd.7 > > 9hdd9.5 1.0 9.5 TiB ... 63.07 ... 268 > > osd.9 > > 13hdd2.67029 1.0 2.7 TiB ... 53.59 ... 65 > > osd.13 > > 16hdd2.8 1.0 2.7 TiB ... 59.35 ... 71 > > osd.16 > > 19hdd1.7 1.0 1.7 TiB ... 48.98 ... 37 > > osd.19 > > 23hdd2.38419 1.0 2.4 TiB ... 59.33 ... 64 > > osd.23 > > 24hdd1.3 1.0 1.7 TiB ... 51.23 ... 39 > > osd.24 > > 28hdd3.63869 1.0 3.6 TiB ... 64.17 ... 104 > > osd.28 > > 31hdd2.7 1.0 2.7 TiB ... 64.73 ... 76 > > osd.31 > > 32hdd3.3 1.0 3.3 TiB ... 67.28 ... 101 > > osd.32 > > -9 22.88817 - 23 TiB ... 56.96 ... - host > > ceph06 > > 35hdd7.15259 1.0 7.2 TiB ... 55.71 ... 182 > > osd.35 > > 36hdd5.24519 1.0 5.2 TiB ... 53.75 ... 128 > > osd.36 > > 45hdd5.24519 1.0 5.2 TiB ... 60.91 ... 144 > > osd.45 > > 48hdd5.24519 1.0 5.2 TiB ... 57.94 ... 139 > > osd.48 > > -17 26.21239 - 26 TiB ... 55.67 ... - host > > ceph08 > > 37hdd6.67569 1.0 6.7 TiB ... 58.17 ... 174 > > osd.37 > > 40hdd9.53670 1.0 9.5 TiB ... 58.54 ... 250 > > osd.40 > > 46hdd5.0 1.0 5.0 TiB ... 52.39 ... 116 > > osd.46 > > 47hdd5.0 1.0 5.0 TiB ... 50.05 ... 112 > > osd.47 > > -20 59.11053 - 60 TiB ... 82.47 ... - rack > > RZ1.Reihe4.R9 > > -4 23.09996 - 24 TiB ... 79.92 ... - host > > ceph03 > > 5hdd1.7 0.75006 1.7 TiB ... 87.24 ... 66 > > osd.5 > > 6hdd1.7 0.44998 1.7 TiB ... 47.30 ... 36 > > osd.6 > > 10hdd2.7 0.85004 2.7 TiB ... 83.23 ... 100 > > osd.10 > > 15hdd2.7 0.75006 2.7 TiB ... 74.26 ... 88 > > osd.15 > > 17hdd0.5 0.85004 1.6 TiB ... 91.44 ... 67 > > osd.17 > > 20hdd2.0 0.85004 1.7 TiB ... 88.41 ... 68 > > osd.20 > > 21hdd2.7 0.75006 2.7 TiB ... 77.25 ... 91 > > osd.21 > > 25hdd1.7 0.90002 1.7 TiB ... 78.31 ... 60 > > osd.25 > > 26hdd2.7 1.0 2.7 TiB ... 82.75 ... 99 > > osd.26 > > 27hdd2.7 0.90002 2.7 TiB ... 84.26 ... 101 > > osd.27 > > 63hdd1.8 0.90002 1.7 TiB ... 84.15 ... 65 > > osd.63 > > -13 36.01057 - 36 TiB ... 84.12 ... - host > > ceph05 > > 11hdd7.15259 0.90002 7.2 TiB ... 85.45 ... 273 > > osd.11 > > 39hdd7.2 0.85004 7.2 TiB ... 80.90 ... 257 > > osd.39 > > 41hdd7.2 0.75006 7.2 TiB ... 74.95 ... 239 > > osd.41 > > 42hdd9.0 1.0 9.5 TiB ... 92.00 ... 392 > > osd.42 > > 43hdd5.45799 1.0 5.5 TiB ... 84.84 ... 207 > > osd.43 > > -21 65.79662 - 66 TiB ... 74.29 ... - rack > > RZ3.Reihe3.R10 > > -2 28.49664 - 29 TiB ... 74.79 ... - host > > ceph01 > > 0hdd2.7 1.0 2.7 TiB ... 73.82 ... 88 > > osd.0 > > 1hdd3.63869 1.0 3.6 TiB ... 73.47 ... 121 > > osd.1 > > 4hdd2.7 1.0 2.7 TiB ... 74.63 ... 89 > > osd.4 > > 8hdd2.7 1.0 2.7 TiB ... 77.10 ... 92 > > osd.8 > > 12hdd2.7 1.0 2.7 TiB ... 78.76 ... 94 > > osd.12 > > 14hdd5.45799 1.0 5.5 TiB ... 78.86 ... 193 > > osd.14 > > 18hdd1.8 1.0 2.7 TiB ... 63.79 ... 76 > > osd.18 > > 22hdd
[ceph-users] Re: OSDs not balanced
Did the balancer has enabled pools ? "ceph balancer pool ls" Actually I am wondering if the balancer do something when no pools are added. On Mon, Mar 4, 2024, 11:30 Ml Ml wrote: > Hello, > > i wonder why my autobalancer is not working here: > > root@ceph01:~# ceph -s > cluster: > id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df > health: HEALTH_ERR > 1 backfillfull osd(s) > 1 full osd(s) > 1 nearfull osd(s) > 4 pool(s) full > > => osd.17 was too full (92% or something like that) > > root@ceph01:~# ceph osd df tree > ID CLASS WEIGHT REWEIGHT SIZE ... %USE ... PGS TYPE NAME > -25 209.50084 - 213 TiB ... 69.56 ... - datacenter > xxx-dc-root > -19 84.59369 - 86 TiB ... 56.97 ... - rack > RZ1.Reihe4.R10 > -3 35.49313 - 37 TiB ... 57.88 ... - host > ceph02 > 2hdd1.7 1.0 1.7 TiB ... 58.77 ... 44 > osd.2 > 3hdd1.0 1.0 2.7 TiB ... 22.14 ... 25 > osd.3 > 7hdd2.5 1.0 2.7 TiB ... 58.84 ... 70 > osd.7 > 9hdd9.5 1.0 9.5 TiB ... 63.07 ... 268 > osd.9 > 13hdd2.67029 1.0 2.7 TiB ... 53.59 ... 65 > osd.13 > 16hdd2.8 1.0 2.7 TiB ... 59.35 ... 71 > osd.16 > 19hdd1.7 1.0 1.7 TiB ... 48.98 ... 37 > osd.19 > 23hdd2.38419 1.0 2.4 TiB ... 59.33 ... 64 > osd.23 > 24hdd1.3 1.0 1.7 TiB ... 51.23 ... 39 > osd.24 > 28hdd3.63869 1.0 3.6 TiB ... 64.17 ... 104 > osd.28 > 31hdd2.7 1.0 2.7 TiB ... 64.73 ... 76 > osd.31 > 32hdd3.3 1.0 3.3 TiB ... 67.28 ... 101 > osd.32 > -9 22.88817 - 23 TiB ... 56.96 ... - host > ceph06 > 35hdd7.15259 1.0 7.2 TiB ... 55.71 ... 182 > osd.35 > 36hdd5.24519 1.0 5.2 TiB ... 53.75 ... 128 > osd.36 > 45hdd5.24519 1.0 5.2 TiB ... 60.91 ... 144 > osd.45 > 48hdd5.24519 1.0 5.2 TiB ... 57.94 ... 139 > osd.48 > -17 26.21239 - 26 TiB ... 55.67 ... - host > ceph08 > 37hdd6.67569 1.0 6.7 TiB ... 58.17 ... 174 > osd.37 > 40hdd9.53670 1.0 9.5 TiB ... 58.54 ... 250 > osd.40 > 46hdd5.0 1.0 5.0 TiB ... 52.39 ... 116 > osd.46 > 47hdd5.0 1.0 5.0 TiB ... 50.05 ... 112 > osd.47 > -20 59.11053 - 60 TiB ... 82.47 ... - rack > RZ1.Reihe4.R9 > -4 23.09996 - 24 TiB ... 79.92 ... - host > ceph03 > 5hdd1.7 0.75006 1.7 TiB ... 87.24 ... 66 > osd.5 > 6hdd1.7 0.44998 1.7 TiB ... 47.30 ... 36 > osd.6 > 10hdd2.7 0.85004 2.7 TiB ... 83.23 ... 100 > osd.10 > 15hdd2.7 0.75006 2.7 TiB ... 74.26 ... 88 > osd.15 > 17hdd0.5 0.85004 1.6 TiB ... 91.44 ... 67 > osd.17 > 20hdd2.0 0.85004 1.7 TiB ... 88.41 ... 68 > osd.20 > 21hdd2.7 0.75006 2.7 TiB ... 77.25 ... 91 > osd.21 > 25hdd1.7 0.90002 1.7 TiB ... 78.31 ... 60 > osd.25 > 26hdd2.7 1.0 2.7 TiB ... 82.75 ... 99 > osd.26 > 27hdd2.7 0.90002 2.7 TiB ... 84.26 ... 101 > osd.27 > 63hdd1.8 0.90002 1.7 TiB ... 84.15 ... 65 > osd.63 > -13 36.01057 - 36 TiB ... 84.12 ... - host > ceph05 > 11hdd7.15259 0.90002 7.2 TiB ... 85.45 ... 273 > osd.11 > 39hdd7.2 0.85004 7.2 TiB ... 80.90 ... 257 > osd.39 > 41hdd7.2 0.75006 7.2 TiB ... 74.95 ... 239 > osd.41 > 42hdd9.0 1.0 9.5 TiB ... 92.00 ... 392 > osd.42 > 43hdd5.45799 1.0 5.5 TiB ... 84.84 ... 207 > osd.43 > -21 65.79662 - 66 TiB ... 74.29 ... - rack > RZ3.Reihe3.R10 > -2 28.49664 - 29 TiB ... 74.79 ... - host > ceph01 > 0hdd2.7 1.0 2.7 TiB ... 73.82 ... 88 > osd.0 > 1hdd3.63869 1.0 3.6 TiB ... 73.47 ... 121 > osd.1 > 4hdd2.7 1.0 2.7 TiB ... 74.63 ... 89 > osd.4 > 8hdd2.7 1.0 2.7 TiB ... 77.10 ... 92 > osd.8 > 12hdd2.7 1.0 2.7 TiB ... 78.76 ... 94 > osd.12 > 14hdd5.45799 1.0 5.5 TiB ... 78.86 ... 193 > osd.14 > 18hdd1.8 1.0 2.7 TiB ... 63.79 ... 76 > osd.18 > 22hdd1.7 1.0 1.7 TiB ... 74.85 ... 57 > osd.22 > 30hdd1.7 1.0 1.7 TiB ... 76.34 ... 59 > osd.30 > 64hdd3.2 1.0 3.3 TiB ... 73.48 ... 110 > osd.64 > -11 12.3 - 12 TiB ... 73.40 ... - host > ceph04 > 34hdd5.2 1.0 5.2 TiB ... 72.81 ... 171 > osd.34 > 44hdd7.2 1.0
[ceph-users] Re: Performance improvement suggestion
On 3/4/24 08:40, Maged Mokhtar wrote: On 04/03/2024 15:37, Frank Schilder wrote: Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? This should not be different than what exists today..unless of-course if the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! i think this is something the rados devs need to say. it does sound worth investigating. it is not just for cases with db compaction but more importantly the normal(happy) io path as it will have the most impact. Typically a L0->L1 compaction will have two primary effects: 1) It will cause large IO read/write traffic to the disk potentially impacting other IO taking place if the disk is already saturated. 2) It will block memtable flushes until the compaction finishes. This means that more and more data will accumulate in the memtables/WAL which can trigger throttling and eventually stalls if you run out of buffer space. By default, we allow up to 1GB of writes to WAL/memtables before writes are fully stalled, but RocksDB will typlically throttle writes before you get to that point. It's possible a larger buffer may allow you to absorb traffic spikes for longer at the expense of more disk and memory usage. Ultimately though, if you are hitting throttling, it means that the DB can't keep up with the WAL ingestion rate. Mark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: mark.nel...@clyso.com We are hiring: https://www.clyso.com/jobs/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs not balanced
> I think the short answer is "because you have so wildly varying sizes > both for drives and hosts". Arguably OP's OSDs *are* balanced in that their PGs are roughly in line with their sizes, but indeed the size disparity is problematic in some ways. Notably, the 500GB OSD should just be removed. I think balancing doesn't account for WAL/DB/other overhead, so it won't be accurately accounted for and can't hold much data nyway. This cluster shows evidence of reweight-by-utilization having been run, but only on two of the hosts. If the balancer module is active, those override weights will confound it. > > If your drive sizes span from 0.5 to 9.5, there will naturally be > skewed data, and it is not a huge surprise that the automation has > some troubles getting it "good". When the balancer places a PG on a > 0.5-sized drive compared to a 9.5-sized one, it eats up 19x more of > the "free space" on the smaller one, so there are very few good > options when the sizes are so different. Even if you placed all PGs > correctly due to size, the 9.5-sized disk would end up getting 19x > more IO than the small drive and for hdd, it seldom is possible to > gracefully handle a 19-fold increase in IO, most of the time will > probably be spent on seeks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: v16.2.15 Pacific released
This is great news! Many thanks! /Z On Mon, 4 Mar 2024 at 17:25, Yuri Weinstein wrote: > We're happy to announce the 15th, and expected to be the last, > backport release in the Pacific series. > > https://ceph.io/en/news/blog/2024/v16-2-15-pacific-released/ > > Notable Changes > --- > > * `ceph config dump --format ` output will display the localized > option names instead of their normalized version. For example, > "mgr/prometheus/x/server_port" will be displayed instead of > "mgr/prometheus/server_port". This matches the output of the non > pretty-print > formatted version of the command. > > * CephFS: MDS evicts clients who are not advancing their request tids, > which causes > a large buildup of session metadata, resulting in the MDS going > read-only due to > the RADOS operation exceeding the size threshold. The > `mds_session_metadata_threshold` > config controls the maximum size that an (encoded) session metadata can > grow. > > * RADOS: The `get_pool_is_selfmanaged_snaps_mode` C++ API has been > deprecated > due to its susceptibility to false negative results. Its safer > replacement is > `pool_is_in_selfmanaged_snaps_mode`. > > * RBD: When diffing against the beginning of time (`fromsnapname == NULL`) > in > fast-diff mode (`whole_object == true` with `fast-diff` image feature > enabled > and valid), diff-iterate is now guaranteed to execute locally if > exclusive > lock is available. This brings a dramatic performance improvement for > QEMU > live disk synchronization and backup use cases. > > Getting Ceph > > * Git at git://github.com/ceph/ceph.git > * Tarball at https://download.ceph.com/tarballs/ceph-16.2.15.tar.gz > * Containers at https://quay.io/repository/ceph/ceph > * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ > * Release git sha1: 618f440892089921c3e944a991122ddc44e60516 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] v16.2.15 Pacific released
We're happy to announce the 15th, and expected to be the last, backport release in the Pacific series. https://ceph.io/en/news/blog/2024/v16-2-15-pacific-released/ Notable Changes --- * `ceph config dump --format ` output will display the localized option names instead of their normalized version. For example, "mgr/prometheus/x/server_port" will be displayed instead of "mgr/prometheus/server_port". This matches the output of the non pretty-print formatted version of the command. * CephFS: MDS evicts clients who are not advancing their request tids, which causes a large buildup of session metadata, resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. The `mds_session_metadata_threshold` config controls the maximum size that an (encoded) session metadata can grow. * RADOS: The `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated due to its susceptibility to false negative results. Its safer replacement is `pool_is_in_selfmanaged_snaps_mode`. * RBD: When diffing against the beginning of time (`fromsnapname == NULL`) in fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled and valid), diff-iterate is now guaranteed to execute locally if exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases. Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-16.2.15.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: 618f440892089921c3e944a991122ddc44e60516 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
On 04/03/2024 15:37, Frank Schilder wrote: Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? This should not be different than what exists today..unless of-course if the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! i think this is something the rados devs need to say. it does sound worth investigating. it is not just for cases with db compaction but more importantly the normal(happy) io path as it will have the most impact. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
>>> Fast write enabled would mean that the primary OSD sends #size copies to the >>> entire active set (including itself) in parallel and sends an ACK to the >>> client as soon as min_size ACKs have been received from the peers (including >>> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow >>> for whatever reason) without suffering performance penalties immediately >>> (only after too many requests started piling up, which will show as a slow >>> requests warning). >>> >> What happens if there occurs an error on the slowest osd after the min_size >> ACK has already been send to the client? >> >This should not be different than what exists today..unless of-course if >the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] [Quincy] cannot configure dashboard to listen on all ports
Hi, ceph dashboard fails to listen on all IPs. log_channel(cluster) log [ERR] : Unhandled exception from module 'dashboard' while running on mgr.controllera: OSError("No socket could be created -- (('0.0.0.0', 8443): [Errno -2] Name or service not known) -- (('::', 8443, 0, 0): ceph version 17.2.7 quincy (stable) Regards. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
On 04/03/2024 13:35, Marc wrote: Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? This should not be different than what exists today..unless of-course if the error happens on the local/primary osd ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
> > Fast write enabled would mean that the primary OSD sends #size copies to the > entire active set (including itself) in parallel and sends an ACK to the > client as soon as min_size ACKs have been received from the peers (including > itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow > for whatever reason) without suffering performance penalties immediately > (only after too many requests started piling up, which will show as a slow > requests warning). > What happens if there occurs an error on the slowest osd after the min_size ACK has already been send to the client? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSDs not balanced
Den mån 4 mars 2024 kl 11:30 skrev Ml Ml : > > Hello, > > i wonder why my autobalancer is not working here: I think the short answer is "because you have so wildly varying sizes both for drives and hosts". If your drive sizes span from 0.5 to 9.5, there will naturally be skewed data, and it is not a huge surprise that the automation has some troubles getting it "good". When the balancer places a PG on a 0.5-sized drive compared to a 9.5-sized one, it eats up 19x more of the "free space" on the smaller one, so there are very few good options when the sizes are so different. Even if you placed all PGs correctly due to size, the 9.5-sized disk would end up getting 19x more IO than the small drive and for hdd, it seldom is possible to gracefully handle a 19-fold increase in IO, most of the time will probably be spent on seeks. > root@ceph01:~# ceph -s > cluster: > id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df > health: HEALTH_ERR > 1 backfillfull osd(s) > 1 full osd(s) > 1 nearfull osd(s) > 4 pool(s) full > > => osd.17 was too full (92% or something like that) > > root@ceph01:~# ceph osd df tree > ID CLASS WEIGHT REWEIGHT SIZE ... %USE ... PGS TYPE NAME > -25 209.50084 - 213 TiB ... 69.56 ... - datacenter > xxx-dc-root > -19 84.59369 - 86 TiB ... 56.97 ... - rack > RZ1.Reihe4.R10 > -3 35.49313 - 37 TiB ... 57.88 ... - host > ceph02 > 2hdd1.7 1.0 1.7 TiB ... 58.77 ... 44 osd.2 > 3hdd1.0 1.0 2.7 TiB ... 22.14 ... 25 osd.3 > 7hdd2.5 1.0 2.7 TiB ... 58.84 ... 70 osd.7 > 9hdd9.5 1.0 9.5 TiB ... 63.07 ... 268 osd.9 > 13hdd2.67029 1.0 2.7 TiB ... 53.59 ... 65 osd.13 > 16hdd2.8 1.0 2.7 TiB ... 59.35 ... 71 osd.16 > 19hdd1.7 1.0 1.7 TiB ... 48.98 ... 37 osd.19 > 23hdd2.38419 1.0 2.4 TiB ... 59.33 ... 64 osd.23 > 24hdd1.3 1.0 1.7 TiB ... 51.23 ... 39 osd.24 > 28hdd3.63869 1.0 3.6 TiB ... 64.17 ... 104 osd.28 > 31hdd2.7 1.0 2.7 TiB ... 64.73 ... 76 osd.31 > 32hdd3.3 1.0 3.3 TiB ... 67.28 ... 101 osd.32 > -9 22.88817 - 23 TiB ... 56.96 ... - host > ceph06 > 35hdd7.15259 1.0 7.2 TiB ... 55.71 ... 182 osd.35 > 36hdd5.24519 1.0 5.2 TiB ... 53.75 ... 128 osd.36 > 45hdd5.24519 1.0 5.2 TiB ... 60.91 ... 144 osd.45 > 48hdd5.24519 1.0 5.2 TiB ... 57.94 ... 139 osd.48 > -17 26.21239 - 26 TiB ... 55.67 ... - host > ceph08 > 37hdd6.67569 1.0 6.7 TiB ... 58.17 ... 174 osd.37 > 40hdd9.53670 1.0 9.5 TiB ... 58.54 ... 250 osd.40 > 46hdd5.0 1.0 5.0 TiB ... 52.39 ... 116 osd.46 > 47hdd5.0 1.0 5.0 TiB ... 50.05 ... 112 osd.47 > -20 59.11053 - 60 TiB ... 82.47 ... - rack > RZ1.Reihe4.R9 > -4 23.09996 - 24 TiB ... 79.92 ... - host > ceph03 > 5hdd1.7 0.75006 1.7 TiB ... 87.24 ... 66 osd.5 > 6hdd1.7 0.44998 1.7 TiB ... 47.30 ... 36 osd.6 > 10hdd2.7 0.85004 2.7 TiB ... 83.23 ... 100 osd.10 > 15hdd2.7 0.75006 2.7 TiB ... 74.26 ... 88 osd.15 > 17hdd0.5 0.85004 1.6 TiB ... 91.44 ... 67 osd.17 > 20hdd2.0 0.85004 1.7 TiB ... 88.41 ... 68 osd.20 > 21hdd2.7 0.75006 2.7 TiB ... 77.25 ... 91 osd.21 > 25hdd1.7 0.90002 1.7 TiB ... 78.31 ... 60 osd.25 > 26hdd2.7 1.0 2.7 TiB ... 82.75 ... 99 osd.26 > 27hdd2.7 0.90002 2.7 TiB ... 84.26 ... 101 osd.27 > 63hdd1.8 0.90002 1.7 TiB ... 84.15 ... 65 osd.63 > -13 36.01057 - 36 TiB ... 84.12 ... - host > ceph05 > 11hdd7.15259 0.90002 7.2 TiB ... 85.45 ... 273 osd.11 > 39hdd7.2 0.85004 7.2 TiB ... 80.90 ... 257 osd.39 > 41hdd7.2 0.75006 7.2 TiB ... 74.95 ... 239 osd.41 > 42hdd9.0 1.0 9.5 TiB ... 92.00 ... 392 osd.42 > 43hdd5.45799 1.0 5.5 TiB ... 84.84 ... 207 osd.43 > -21 65.79662 - 66 TiB ... 74.29 ... - rack > RZ3.Reihe3.R10 > -2 28.49664 - 29 TiB ... 74.79 ... -
[ceph-users] OSDs not balanced
Hello, i wonder why my autobalancer is not working here: root@ceph01:~# ceph -s cluster: id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df health: HEALTH_ERR 1 backfillfull osd(s) 1 full osd(s) 1 nearfull osd(s) 4 pool(s) full => osd.17 was too full (92% or something like that) root@ceph01:~# ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE ... %USE ... PGS TYPE NAME -25 209.50084 - 213 TiB ... 69.56 ... - datacenter xxx-dc-root -19 84.59369 - 86 TiB ... 56.97 ... - rack RZ1.Reihe4.R10 -3 35.49313 - 37 TiB ... 57.88 ... - host ceph02 2hdd1.7 1.0 1.7 TiB ... 58.77 ... 44 osd.2 3hdd1.0 1.0 2.7 TiB ... 22.14 ... 25 osd.3 7hdd2.5 1.0 2.7 TiB ... 58.84 ... 70 osd.7 9hdd9.5 1.0 9.5 TiB ... 63.07 ... 268 osd.9 13hdd2.67029 1.0 2.7 TiB ... 53.59 ... 65 osd.13 16hdd2.8 1.0 2.7 TiB ... 59.35 ... 71 osd.16 19hdd1.7 1.0 1.7 TiB ... 48.98 ... 37 osd.19 23hdd2.38419 1.0 2.4 TiB ... 59.33 ... 64 osd.23 24hdd1.3 1.0 1.7 TiB ... 51.23 ... 39 osd.24 28hdd3.63869 1.0 3.6 TiB ... 64.17 ... 104 osd.28 31hdd2.7 1.0 2.7 TiB ... 64.73 ... 76 osd.31 32hdd3.3 1.0 3.3 TiB ... 67.28 ... 101 osd.32 -9 22.88817 - 23 TiB ... 56.96 ... - host ceph06 35hdd7.15259 1.0 7.2 TiB ... 55.71 ... 182 osd.35 36hdd5.24519 1.0 5.2 TiB ... 53.75 ... 128 osd.36 45hdd5.24519 1.0 5.2 TiB ... 60.91 ... 144 osd.45 48hdd5.24519 1.0 5.2 TiB ... 57.94 ... 139 osd.48 -17 26.21239 - 26 TiB ... 55.67 ... - host ceph08 37hdd6.67569 1.0 6.7 TiB ... 58.17 ... 174 osd.37 40hdd9.53670 1.0 9.5 TiB ... 58.54 ... 250 osd.40 46hdd5.0 1.0 5.0 TiB ... 52.39 ... 116 osd.46 47hdd5.0 1.0 5.0 TiB ... 50.05 ... 112 osd.47 -20 59.11053 - 60 TiB ... 82.47 ... - rack RZ1.Reihe4.R9 -4 23.09996 - 24 TiB ... 79.92 ... - host ceph03 5hdd1.7 0.75006 1.7 TiB ... 87.24 ... 66 osd.5 6hdd1.7 0.44998 1.7 TiB ... 47.30 ... 36 osd.6 10hdd2.7 0.85004 2.7 TiB ... 83.23 ... 100 osd.10 15hdd2.7 0.75006 2.7 TiB ... 74.26 ... 88 osd.15 17hdd0.5 0.85004 1.6 TiB ... 91.44 ... 67 osd.17 20hdd2.0 0.85004 1.7 TiB ... 88.41 ... 68 osd.20 21hdd2.7 0.75006 2.7 TiB ... 77.25 ... 91 osd.21 25hdd1.7 0.90002 1.7 TiB ... 78.31 ... 60 osd.25 26hdd2.7 1.0 2.7 TiB ... 82.75 ... 99 osd.26 27hdd2.7 0.90002 2.7 TiB ... 84.26 ... 101 osd.27 63hdd1.8 0.90002 1.7 TiB ... 84.15 ... 65 osd.63 -13 36.01057 - 36 TiB ... 84.12 ... - host ceph05 11hdd7.15259 0.90002 7.2 TiB ... 85.45 ... 273 osd.11 39hdd7.2 0.85004 7.2 TiB ... 80.90 ... 257 osd.39 41hdd7.2 0.75006 7.2 TiB ... 74.95 ... 239 osd.41 42hdd9.0 1.0 9.5 TiB ... 92.00 ... 392 osd.42 43hdd5.45799 1.0 5.5 TiB ... 84.84 ... 207 osd.43 -21 65.79662 - 66 TiB ... 74.29 ... - rack RZ3.Reihe3.R10 -2 28.49664 - 29 TiB ... 74.79 ... - host ceph01 0hdd2.7 1.0 2.7 TiB ... 73.82 ... 88 osd.0 1hdd3.63869 1.0 3.6 TiB ... 73.47 ... 121 osd.1 4hdd2.7 1.0 2.7 TiB ... 74.63 ... 89 osd.4 8hdd2.7 1.0 2.7 TiB ... 77.10 ... 92 osd.8 12hdd2.7 1.0 2.7 TiB ... 78.76 ... 94 osd.12 14hdd5.45799 1.0 5.5 TiB ... 78.86 ... 193 osd.14 18hdd1.8 1.0 2.7 TiB ... 63.79 ... 76 osd.18 22hdd1.7 1.0 1.7 TiB ... 74.85 ... 57 osd.22 30hdd1.7 1.0 1.7 TiB ... 76.34 ... 59 osd.30 64hdd3.2 1.0 3.3 TiB ... 73.48 ... 110 osd.64 -11 12.3 - 12 TiB ... 73.40 ... - host ceph04 34hdd5.2 1.0 5.2 TiB
[ceph-users] Re: Performance improvement suggestion
Hi all, coming late to the party but want to ship in as well with some experience. The problem of tail latencies of individual OSDs is a real pain for any redundant storage system. However, there is a way to deal with this in an elegant way when using large replication factors. The idea is to use the counterpart of the "fast read" option that exists for EC pools and: 1) make this option available to replicated pools as well (is on the road map as far as I know), but also 2) implement an option "fast write" for all pool types. Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). I have fast read enabled on all EC pools. This does increase the cluster-internal network traffic, which is nowadays absolutely no problem (in the good old 1G times it potentially would be). In return, the read latencies on the client side are lower and much more predictable. In effect, the user experience improved dramatically. I would really wish that such an option gets added as we use wide replication profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large replication factors (more precisely, large (size-min_size)) to mitigate the impact of slow OSDs would be awesome. It would also add some incentive to stop the ridiculous size=2 min_size=1 habit, because one gets an extra gain from replication on top of redundancy. In the long run, the ceph write path should try to deal with a-priori known different-latency connections (fast local ACK with async remote completion, was asked for a couple of times), for example, for stretched clusters where one has an internal connection for the local part and external connections for the remote parts. It would be great to have similar ways of mitigating some penalties of the slow write paths to remote sites. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Peter Grandi Sent: Wednesday, February 21, 2024 1:10 PM To: list Linux fs Ceph Subject: [ceph-users] Re: Performance improvement suggestion > 1. Write object A from client. > 2. Fsync to primary device completes. > 3. Ack to client. > 4. Writes sent to replicas. [...] As mentioned in the discussion this proposal is the opposite of what the current policy, is, which is to wait for all replicas to be written before writes are acknowledged to the client: https://github.com/ceph/ceph/blob/main/doc/architecture.rst "After identifying the target placement group, the client writes the object to the identified placement group's primary OSD. The primary OSD then [...] confirms that the object was stored successfully in the secondary and tertiary OSDs, and reports to the client that the object was stored successfully." A more revolutionary option would be for 'librados' to write in parallel to all the "active set" OSDs and report this to the primary, but that would greatly increase client-Ceph traffic, while the current logic increases traffic only among OSDs. > So I think that to maintain any semblance of reliability, > you'd need to at least wait for a commit ack from the first > replica (i.e. min_size=2). Perhaps it could be similar to 'k'+'m' for EC, that is 'k' synchronous (write completes to the client only when all at least 'k' replicas, including primary, have been committed) and 'm' asynchronous, instead of 'k' being just 1 or 2. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)
Hi, El 2/3/24 a las 18:00, Tyler Stachecki escribió: On 23.02.24 16:18, Christian Rohmann wrote: I just noticed issues with ceph-crash using the Debian /Ubuntu packages (package: ceph-base): While the /var/lib/ceph/crash/posted folder is created by the package install, it's not properly chowned to ceph:ceph by the postinst script. ... You might want to check if you might be affected as well. Failing to post crashes to the local cluster results in them not being reported back via telemetry. Sorry to bluntly bump this again, but did nobody else notice this on your clusters? Call me egoistic, but the more clusters return crash reports the more stable my Ceph likely becomes ;-) I do observe the ownership does not match ceph:ceph on Debian with v17.2.7. $ sudo ls -l /var/lib/ceph/crash | grep posted drwxr-xr-x 2 root root 4096 Feb 10 19:23 posted The issue seems to be that the postinst script does not recursively chown and only chowns subdirectories directly under /var/lib/ceph: https://github.com/ceph/ceph/blob/91e8cea0d31775de0e59936b3608a9a453353a45/debian/ceph-base.postinst#L40 The rpm spec looks to do subdirectories under /var/lib/ceph as well, but explicitly lists everything out instead of globs, and also includes posted: https://github.com/ceph/ceph/blob/91e8cea0d31775de0e59936b3608a9a453353a45/ceph.spec.in#L1643 This seems to have been fixed in Proxmox recently: * master (reef?): https://lists.proxmox.com/pipermail/pve-devel/2024-February/061803.html * quincy: https://lists.proxmox.com/pipermail/pve-devel/2024-February/061798.html Not sure this has been upstreamed. Cheers Eneko Lacunza Zuzendari teknikoa | Director técnico Binovo IT Human Project Tel. +34 943 569 206 | https://www.binovo.es Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io