[ceph-users] Re: data corruption after rbd migration
Hello Jaroslav, thank you for you reply.. > I found your info a bit confusing. The first command suggests that the VM > is shut down and later you are talking about live migration. So how are you > migrating data online or offline? online, the VM is started after migration prepare command: .. > > rbd migration prepare ssd/D1 sata/D1Z > > virsh create xml_new.xml .. > > In the case of live migration, I would suggest looking at the > fsfreeze(proxmox use it) command. I don't think this is related, I'm not concerned about FS consistency during snapshotting, but the fact, that checksums of the snapshot which should be same differs after migration.. to clarify the usecase, we've noticed backups corruption for volumes which were migrated between pools during backups (using snapshots and moving of them to other (SATA) pool) BR nik > > Hope it helps! > > Best Regards, > > Jaroslav Shejbal > > > > pá 3. 11. 2023 v 9:08 odesílatel Nikola Ciprich > napsal: > > > Dear ceph users and developers, > > > > we're struggling with strange issue which I think might be a bug > > causing snapshot data corruption while migrating RBD image > > > > we've tracked it to minimal set of steps to reproduce using VM > > with one 32G drive: > > > > rbd create --size 32768 sata/D2 > > virsh create xml_orig.xml > > rbd snap create ssd/D1@snap1 > > rbd export-diff ssd/D1@snap1 - | rbd import-diff - sata/D2 > > rbd export --export-format 1 --no-progress ssd/D1@snap1 - | xxh64sum > > 505dde3c49775773 > > rbd export --export-format 1 --no-progress sata/D2@snap1 - | xxh64sum > > 505dde3c49775773 # <- checksums match - OK > > > > virsh shutdown VM > > rbd migration prepare ssd/D1 sata/D1Z > > virsh create xml_new.xml > > rbd snap create sata/D1Z@snap2 > > rbd export-diff --from-snap snap1 sata/D1Z@snap2 - | rbd import-diff - > > sata/D2 > > rbd migration execute sata/D1Z > > rbd migration commit sata/D1Z > > rbd export --export-format 1 --no-progress sata/D1Z@snap2 - | xxh64sum > > 19892545c742c1de > > rbd export --export-format 1 --no-progress sata/D2@snap2 - | xxh64sum > > cc045975baf74ba8 # <- snapshosts differ > > > > OS is alma 9 based, kernel 5.15.105, CEPH 17.2.6, qemu-8.0.3 > > we tried disabling VM disk caches as well as discard, to no avail. > > > > my first question is, is it correct to assume creating snapshots while live > > migrating data is safe? if so, any ideas on where the problem could be? > > > > If I could provide more info, please let me know > > > > with regards > > > > nikola ciprich > > > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax: +420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] data corruption after rbd migration
Dear ceph users and developers, we're struggling with strange issue which I think might be a bug causing snapshot data corruption while migrating RBD image we've tracked it to minimal set of steps to reproduce using VM with one 32G drive: rbd create --size 32768 sata/D2 virsh create xml_orig.xml rbd snap create ssd/D1@snap1 rbd export-diff ssd/D1@snap1 - | rbd import-diff - sata/D2 rbd export --export-format 1 --no-progress ssd/D1@snap1 - | xxh64sum 505dde3c49775773 rbd export --export-format 1 --no-progress sata/D2@snap1 - | xxh64sum 505dde3c49775773 # <- checksums match - OK virsh shutdown VM rbd migration prepare ssd/D1 sata/D1Z virsh create xml_new.xml rbd snap create sata/D1Z@snap2 rbd export-diff --from-snap snap1 sata/D1Z@snap2 - | rbd import-diff - sata/D2 rbd migration execute sata/D1Z rbd migration commit sata/D1Z rbd export --export-format 1 --no-progress sata/D1Z@snap2 - | xxh64sum 19892545c742c1de rbd export --export-format 1 --no-progress sata/D2@snap2 - | xxh64sum cc045975baf74ba8 # <- snapshosts differ OS is alma 9 based, kernel 5.15.105, CEPH 17.2.6, qemu-8.0.3 we tried disabling VM disk caches as well as discard, to no avail. my first question is, is it correct to assume creating snapshots while live migrating data is safe? if so, any ideas on where the problem could be? If I could provide more info, please let me know with regards nikola ciprich -- --------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed
Hello Igor, just reporting, that since last restart (after reverting changed values to their defaults) the performance hasn't decreased (and it's been over two weeks now). So either it helped after all, or the drop is caused by something else I'll yet have to figure out.. we've automated the test so once the performance drops beyond threshold, I'll know it and investigate further (and report) cheers with regards nik On Wed, May 10, 2023 at 07:36:06AM +0200, Nikola Ciprich wrote: > Hello Igor, > > You didn't reset the counters every hour, do you? So having average > > subop_w_latency growing that way means the current values were much higher > > than before. > > bummer, I didn't.. I've updated gather script to reset stats, wait 10m and > then > gather perf data, each hour. It's running since yesterday, so now we'll have > to wait > about one week for the problem to appear again.. > > > > > > Curious if subop latencies were growing for every OSD or just a subset (may > > be even just a single one) of them? > since I only have long time averaga, it's not easy to say, but based on what > we have: > > only two OSDs avg got sub_w_lat > 0.0006. no clear relation between them > 19 OSDs got avg sub_w_lat > 0.0005 - this is more interesting - 15 out of them > are on those later installed nodes (note that those nodes have almost no VMs > running > so they are much less used!) 4 are on other nodes. but also note, that not all > of OSDs on suspicious nodes are over the threshold, it's 6, 6 and 3 out of 7 > OSDs > on the node. but still it's strange.. > > > > > > > Next time you reach the bad state please do the following if possible: > > > > - reset perf counters for every OSD > > > > - leave the cluster running for 10 mins and collect perf counters again. > > > > - Then start restarting OSD one-by-one starting with the worst OSD (in terms > > of subop_w_lat from the prev step). Wouldn't be sufficient to reset just a > > few OSDs before the cluster is back to normal? > > will do once it slows down again. > > > > > > > > I see very similar crash reported > > > here:https://tracker.ceph.com/issues/56346 > > > so I'm not reporting.. > > > > > > Do you think this might somehow be the cause of the problem? Anything > > > else I should > > > check in perf dumps or elsewhere? > > > > Hmm... don't know yet. Could you please last 20K lines prior the crash from > > e.g two sample OSDs? > > https://storage.linuxbox.cz/index.php/s/o5bMaGMiZQxWadi > > > > > And the crash isn't permanent, OSDs are able to start after the second(?) > > shot, aren't they? > yes, actually they start after issuing systemctl ceph-osd@xx restart, it just > takes > long time performing log recovery.. > > If I can provide more info, please let me know > > BR > > nik > > -- > - > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax:+420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > - > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed
Hello Igor, > You didn't reset the counters every hour, do you? So having average > subop_w_latency growing that way means the current values were much higher > than before. bummer, I didn't.. I've updated gather script to reset stats, wait 10m and then gather perf data, each hour. It's running since yesterday, so now we'll have to wait about one week for the problem to appear again.. > > Curious if subop latencies were growing for every OSD or just a subset (may > be even just a single one) of them? since I only have long time averaga, it's not easy to say, but based on what we have: only two OSDs avg got sub_w_lat > 0.0006. no clear relation between them 19 OSDs got avg sub_w_lat > 0.0005 - this is more interesting - 15 out of them are on those later installed nodes (note that those nodes have almost no VMs running so they are much less used!) 4 are on other nodes. but also note, that not all of OSDs on suspicious nodes are over the threshold, it's 6, 6 and 3 out of 7 OSDs on the node. but still it's strange.. > > > Next time you reach the bad state please do the following if possible: > > - reset perf counters for every OSD > > - leave the cluster running for 10 mins and collect perf counters again. > > - Then start restarting OSD one-by-one starting with the worst OSD (in terms > of subop_w_lat from the prev step). Wouldn't be sufficient to reset just a > few OSDs before the cluster is back to normal? will do once it slows down again. > > > > I see very similar crash reported here:https://tracker.ceph.com/issues/56346 > > so I'm not reporting.. > > > > Do you think this might somehow be the cause of the problem? Anything else > > I should > > check in perf dumps or elsewhere? > > Hmm... don't know yet. Could you please last 20K lines prior the crash from > e.g two sample OSDs? https://storage.linuxbox.cz/index.php/s/o5bMaGMiZQxWadi > > And the crash isn't permanent, OSDs are able to start after the second(?) > shot, aren't they? yes, actually they start after issuing systemctl ceph-osd@xx restart, it just takes long time performing log recovery.. If I can provide more info, please let me know BR nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed
Hello Igor, so I was checking the performance every day since Tuesday.. every day it seemed to be the same - ~ 60-70kOPS on random write from single VM yesterday it finally dropped to 20kOPS today to 10kOPS. I also tried with newly created volume, the result (after prefill) is the same, so it doesn't make any difference.. so I reverted all mentioned options to their defaults and restarted all OSDs. performance immediately returned to better values (I suppose this is again caused by the restart only) good news is, that setting osd_fast_shutdown_timeout to 0 really helped with OSD crashes during restarts, which speeds it up a lot.. but I have some new crashes, more on this later.. > > I'd suggest to start monitoring perf counters for your osds. > > op_w_lat/subop_w_lat ones specifically. I presume they raise eventually, > > don't they? > OK, starting collecting those for all OSDs.. I have hour samples of all OSDs perf dumps loaded in DB, so I can easily examine, sort, whatever.. > > currently values for avgtime are around 0.0003 for subop_w_lat and 0.001-0.002 > for op_w_lat OK, so there is no visible trend on op_w_lat, still between 0.001 and 0.002 subop_w_lat seems to have increased since yesterday though! I see values from 0.0004 to as high as 0.001 If some other perf data might be interesting, please let me know.. During OSD restarts, I noticed strange thing - restarts on first 6 machines went smooth, but then on another 3, I saw rocksdb logs recovery on all SSD OSDs. but first didn't see any mention of daemon crash in ceph -s later, crash info appeared, but only about 3 daemons (in total, at least 20 of them crashed though) crash report was similar for all three OSDs: [root@nrbphav4a ~]# ceph crash info 2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3 { "backtrace": [ "/lib64/libc.so.6(+0x54d90) [0x7f64a6323d90]", "(BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list >*, boost::intrusive_ptr)+0x413) [0x55a1c9d07c43]", "(BlueStore::queue_transactions(boost::intrusive_ptr&, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x22b) [0x55a1c9d27e9b]", "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t const&, std::vector >&&, std::optional&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr)+0x8ad) [0x55a1c9bbcfdd]", "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x38f) [0x55a1c99d1cbf]", "(PrimaryLogPG::simple_opc_submit(std::unique_ptr >)+0x57) [0x55a1c99d6777]", "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr)+0xb73) [0x55a1c99da883]", "/usr/bin/ceph-osd(+0x58794e) [0x55a1c992994e]", "(CommonSafeTimer::timer_thread()+0x11a) [0x55a1c9e226aa]", "/usr/bin/ceph-osd(+0xa80eb1) [0x55a1c9e22eb1]", "/lib64/libc.so.6(+0x9f802) [0x7f64a636e802]", "/lib64/libc.so.6(+0x3f450) [0x7f64a630e450]" ], "ceph_version": "17.2.6", "crash_id": "2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3", "entity_name": "osd.98", "os_id": "almalinux", "os_name": "AlmaLinux", "os_version": "9.0 (Emerald Puma)", "os_version_id": "9.0", "process_name": "ceph-osd", "stack_sig": "b1a1c5bd45e23382497312202e16cfd7a62df018c6ebf9ded0f3b3ca3c1dfa66", "timestamp": "2023-05-08T17:45:47.056675Z", "utsname_hostname": "nrbphav4h", "utsname_machine": "x86_64", "utsname_release": "5.15.90lb9.01", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Fri Jan 27 15:52:13 CET 2023" } I was trying to figure out why this particular 3 nodes could behave differently and found out from colleagues, that those 3 nodes were added to cluster lately with direct install of 17.2.5 (others were installed 15.2.16 and later upgraded) not sure whether this is related to our problem though.. I see very similar crash reported here: https://tracker.ceph.com/issues/56346 so I'm not reporting.. Do you think this might somehow be the cause of the problem? Anything else I should check in perf dumps or elsewhere? with best regards nik -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed
Hello Igor, On Tue, May 02, 2023 at 05:41:04PM +0300, Igor Fedotov wrote: > Hi Nikola, > > I'd suggest to start monitoring perf counters for your osds. > op_w_lat/subop_w_lat ones specifically. I presume they raise eventually, > don't they? OK, starting collecting those for all OSDs.. currently values for avgtime are around 0.0003 for subop_w_lat and 0.001-0.002 for op_w_lat I guess it'll need some time to find some trend, so I'll check tomorrow > > Does subop_w_lat grow for every OSD or just a subset of them? How large is > the delta between the best and the worst OSDs after a one week period? How > many "bad" OSDs are at this point? I'll see and report > > > And some more questions: > > How large are space utilization/fragmentation for your OSDs? OSD usage is around 16-18%. fragmentation should not be very bad, this cluster is deployed for few months only > > Is the same performance drop observed for artificial benchmarks, e.g. 4k > random writes to a fresh RBD image using fio? will check again when the slowdown occurs and report > > Is there any RAM utilization growth for OSD processes over time? Or may be > any suspicious growth in mempool stats? nope, RAM usage seems to be pretty constant. hewever, probably worh noting, historically we're using following OSD options: ceph config set osd bluestore_rocksdb_options compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB ceph config set osd bluestore_cache_autotune 0 ceph config set osd bluestore_cache_size_ssd 2G ceph config set osd bluestore_cache_kv_ratio 0.2 ceph config set osd bluestore_cache_meta_ratio 0.8 ceph config set osd osd_min_pg_log_entries 10 ceph config set osd osd_max_pg_log_entries 10 ceph config set osd osd_pg_log_dups_tracked 10 ceph config set osd osd_pg_log_trim_min 10 so maybe I'll start resetting those to defaults (ie enabling cache autotune etc) as a first step.. > > > As a blind and brute force approach you might also want to compact RocksDB > through ceph-kvstore-tool and switch bluestore allocator to bitmap > (presuming default hybrid one is effective right now). Please do one > modification at a time to realize what action is actually helpful if any. will do.. thanks again for your hints BR nik > > > Thanks, > > Igor > > On 5/2/2023 11:32 AM, Nikola Ciprich wrote: > > Hello dear CEPH users and developers, > > > > we're dealing with strange problems.. we're having 12 node alma linux 9 > > cluster, > > initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running > > bunch > > of KVM virtual machines accessing volumes using RBD. > > > > everything is working well, but there is strange and for us quite serious > > issue > > - speed of write operations (both sequential and random) is constantly > > degrading > > drastically to almost unusable numbers (in ~1week it drops from ~70k 4k > > writes/s > > from 1 VM to ~7k writes/s) > > > > When I restart all OSD daemons, numbers immediately return to normal.. > > > > volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84 > > INTEL SSDPE2KX080T8 NVMEs. > > > > I've updated cluster to 17.2.6 some time ago, but the problem persists. > > This is > > especially annoying in connection with https://tracker.ceph.com/issues/56896 > > as restarting OSDs is quite painfull when half of them crash.. > > > > I don't see anything suspicious, nodes load is quite low, no logs errors, > > network latency and throughput is OK too > > > > Anyone having simimar issue? > > > > I'd like to ask for hints on what should I check further.. > > > > we're running lots of 14.2.x and 15.2.x clusters, none showing similar > > issue, so I'm suspecting this is something related to quincy > > > > thanks a lot in advance > > > > with best regards > > > > nikola ciprich > > > > > > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https:/
[ceph-users] quincy 17.2.6 - write performance continuously slowing down until OSD restart needed
Hello dear CEPH users and developers, we're dealing with strange problems.. we're having 12 node alma linux 9 cluster, initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch of KVM virtual machines accessing volumes using RBD. everything is working well, but there is strange and for us quite serious issue - speed of write operations (both sequential and random) is constantly degrading drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s from 1 VM to ~7k writes/s) When I restart all OSD daemons, numbers immediately return to normal.. volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84 INTEL SSDPE2KX080T8 NVMEs. I've updated cluster to 17.2.6 some time ago, but the problem persists. This is especially annoying in connection with https://tracker.ceph.com/issues/56896 as restarting OSDs is quite painfull when half of them crash.. I don't see anything suspicious, nodes load is quite low, no logs errors, network latency and throughput is OK too Anyone having simimar issue? I'd like to ask for hints on what should I check further.. we're running lots of 14.2.x and 15.2.x clusters, none showing similar issue, so I'm suspecting this is something related to quincy thanks a lot in advance with best regards nikola ciprich -- --------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC pool OSDs getting erroneously "full" (15.2.15)
thanks for the tip on alternative balancer, I'll have a look at it however I don't think the root of the problem is in improper balancing, those 3 OSDs just should not be that full. I'll see how it gets when the snaptrims finis, usage seems to go down by 0.01%/minute now.. I'll report the results later.. > If your clients allow (understand upmaps) you might yield better results > with the balancer in upmap mode. Jonas Jelten made a nice balancer as well > [1]. > > Gr. Stefan > > [1]: https://github.com/TheJJ/ceph-balancer > -- ----- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC pool OSDs getting erroneously "full" (15.2.15)
Hi Stefan, all daemons are 15.2.15 (I'm considering doing update to 15.2.16 today) > What do you have set as neafull ratio? ceph osd dump |grep nearfull. nearfull is 0.87 > > Do you have the ceph balancer enabled? ceph balancer status { "active": true, "last_optimize_duration": "0:00:00.000538", "last_optimize_started": "Wed Apr 20 13:02:26 2022", "mode": "crush-compat", "optimize_result": "Some objects (0.130412) are degraded; try again later", "plans": [] } > What kind of maintenance was going on? we were replacing failing memory module (according to IPMI log, all errors were corrected though..) > > Are the PGs on those OSDs *way* bigger than on those of the other nodes? > ceph pg ls-by-osd $osd-id and check for bytes (and OMAP bytes). Only > accurate information when PGs have been recently deep-scrubbed. sizes seem to be ~similar (each pg is between 65-75GB), if I count sum of them, it's almost twice as big for osd.5 as for osd.53-osd.55 it hasn't been scrubbed due to ongoing recovery though.. but the OMAP sizes shouldn't make such a difference.. > > In this case the PG backfilltoofull warning(s) might have been correct. > Yesterday though, I had no OSDs close to near full ratio and was getting the > same PG backfilltoofull message ... previously seen due to this bug [1]. I > could fix that by setting upmaps for the affacted PGs to another OSD. warning is correct, but the usage value seems to be wrong.. what comes to my mind, there seem to be a lot of pgs waiting for snaptrims.. I'll keep it snaptrimming for some time and see if usage lowers... > > > > > any idea on why could this be happening or what to check? > > I helps to know what kind of maintenance was going on. Sometimes Ceph PG > mappings are not what you want. There are ways to do maintenance in a more > controlled fashion. the maintenance itself wasn't ceph related, it shouldn't cause any PG movements.. one thing to note, I added SSD volume for all OSD DBs to speed up recovery, but we've hat this problem before that, so I don't think this should be the culprit.. BR nik > > > > > thanks a lot in advance for hints.. > > Gr. Stefan > > [1]: https://tracker.ceph.com/issues/39555 > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] EC pool OSDs getting erroneously "full" (15.2.15)
Hi fellow ceph users and developers, we've got into quite strange situation I'm not sure is not a ceph bug.. we have 4 node CEPH cluster with multiple pools. one of them is SATA EC 2+2 pool containting 4x3 10TB drives (one of tham is actually 12TB) one day, after planned downtime of fourth node, we got into strange state where there seemed to be large amount of degraded PGs to recover (we had noout set for the duration of downtime though) the weird thing was, that OSDs of that node seemed to be almost full (ie 80%) while there were almost no PGs on them according to osd df tree leading to backfilltoofull.. after some experimenting, I dropped those and recreated them, but during the recovery, we got into the same state: -31 120.0 - 112 TiB 81 TiB 80 TiB 36 GiB 456 GiB 31 TiB 72.58 1.06- root sata-archive -32 30.0 - 29 TiB 20 TiB 20 TiB 10 GiB 133 GiB 9.5 TiB 67.48 0.99- host v1a-sata-archive 5hdd 10.0 1.0 9.2 TiB 6.2 TiB 6.1 TiB 3.5 GiB 47 GiB 3.0 TiB 67.78 0.99 171 up osd.5 10hdd 10.0 1.0 9.2 TiB 6.2 TiB 6.2 TiB 3.6 GiB 48 GiB 2.9 TiB 68.06 1.00 171 up osd.10 13hdd 10.0 1.0 11 TiB 7.3 TiB 7.3 TiB 3.2 GiB 38 GiB 3.6 TiB 66.73 0.98 170 up osd.13 -33 30.0 - 27 TiB 19 TiB 18 TiB 11 GiB 139 GiB 9.0 TiB 67.39 0.99- host v1b-sata-archive 19hdd 10.0 1.0 9.2 TiB 6.1 TiB 6.1 TiB 3.5 GiB 46 GiB 3.0 TiB 67.11 0.98 171 up osd.19 28hdd 10.0 1.0 9.2 TiB 6.1 TiB 6.0 TiB 3.5 GiB 46 GiB 3.1 TiB 66.44 0.97 170 up osd.28 29hdd 10.0 1.0 9.2 TiB 6.3 TiB 6.2 TiB 3.6 GiB 48 GiB 2.9 TiB 68.61 1.00 171 up osd.29 -34 30.0 - 27 TiB 19 TiB 19 TiB 11 GiB 143 GiB 8.6 TiB 68.65 1.00- host v1c-sata-archive 30hdd 10.0 1.0 9.2 TiB 6.3 TiB 6.2 TiB 3.8 GiB 48 GiB 2.8 TiB 68.91 1.01 171 up osd.30 31hdd 10.0 1.0 9.1 TiB 6.3 TiB 6.3 TiB 3.6 GiB 48 GiB 2.8 TiB 69.20 1.01 171 up osd.31 52hdd 10.0 1.0 9.1 TiB 6.2 TiB 6.1 TiB 3.4 GiB 46 GiB 2.9 TiB 67.84 0.99 170 up osd.52 -35 30.0 - 27 TiB 24 TiB 24 TiB 4.0 GiB 41 GiB 3.5 TiB 87.13 1.27- host v1d-sata-archive 53hdd 10.0 1.0 9.2 TiB 8.1 TiB 8.0 TiB 1.3 GiB 14 GiB 1.0 TiB 88.54 1.29 81 up osd.53 54hdd 10.0 1.0 9.2 TiB 8.3 TiB 8.2 TiB 1.4 GiB 14 GiB 897 GiB 90.44 1.32 79 up osd.54 55hdd 10.0 1.0 9.1 TiB 7.5 TiB 7.5 TiB 1.3 GiB 13 GiB 1.6 TiB 82.39 1.21 62 up osd.55 the count of pgs on osd 53..55 is less then 1/2 of other OSDs but they are almost full. according to weights, this should not happen.. any idea on why could this be happening or what to check? thanks a lot in advance for hints.. with best regards nikola ciprich -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd daemons still reading disks at full speed while there is no pool activity
Hello Josh, just wanted to confirm that setting bluefs_buffered_io immediately helped hotfix the problem. I've also updated to 14.2.22 and we'll discuss adding more NVME modules to move OSD databases out of spinners to prevent further occurances thanks a lot for your time! with best regards nikola ciprich On Wed, Nov 03, 2021 at 09:11:20AM -0600, Josh Baergen wrote: > Hi Nikola, > > > yes, some nodes have stray pgs (1..5) shell I do something about those? > > No need to do anything - Ceph will clean those up itself (and is doing > so right now). I just wanted to confirm my hunch. > > Enabling buffered I/O should have an immediate effect on the read rate > to your disks. I would recommend upgrading to 14.2.17+, though, as the > improvements to PG cleaning are pretty substantial. > > Josh > > On Wed, Nov 3, 2021 at 8:13 AM Nikola Ciprich > wrote: > > > > Hello Josh, > > > > > > Was there PG movement (backfill) happening in this cluster recently? > > > Do the OSDs have stray PGs (e.g. 'ceph daemon osd.NN perf dump | grep > > > numpg_stray' - run this against an affected OSD from the housing > > > node)? > > yes, some nodes have stray pgs (1..5) shell I do something about those? > > > > > > > > > > I'm wondering if you're running into > > > https://tracker.ceph.com/issues/45765, where cleaning of PGs from OSDs > > hmm, yes, this seems very familiar, problems started with using balancer, > > forgot to mention that! > > > > > leads to a high read rate from disk due to a combination of rocksdb > > > behaviour and caching issues. Turning on bluefs_buffered_io (on by > > > default in 14.2.22) is a mitigation for this problem, but has some > > > side effects to watch out for (write IOPS amplification, for one). > > > Fixes for that linked issue went into 14.2.17, 14.2.22, and then > > > Pacific; we found the 14.2.17 changes to be quite effective by > > > themselves. > > > > > > Even if you don't have stray PGs, trying bluefs_buffered_io might be > > > an interesting experiment. An alternative would be to compact rocksdb > > > for each of your OSDs and see if that helps; compacting eliminates the > > > tombstoned data that can cause problems during iteration, but if you > > > have a workload that generates a lot of rocksdb tombstones (like PG > > > cleaning does), then the problem will return a while after compaction. > > > > > > > hmm, I'll try enabling bluefs_buffered_io (it was indeed false) and do the > > compaction as well anyways.. > > > > I'll report back, thanks for the hints! > > > > BR > > > > nik > > > > > > > Josh > > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd daemons still reading disks at full speed while there is no pool activity
Hello Josh, > > Was there PG movement (backfill) happening in this cluster recently? > Do the OSDs have stray PGs (e.g. 'ceph daemon osd.NN perf dump | grep > numpg_stray' - run this against an affected OSD from the housing > node)? yes, some nodes have stray pgs (1..5) shell I do something about those? > > I'm wondering if you're running into > https://tracker.ceph.com/issues/45765, where cleaning of PGs from OSDs hmm, yes, this seems very familiar, problems started with using balancer, forgot to mention that! > leads to a high read rate from disk due to a combination of rocksdb > behaviour and caching issues. Turning on bluefs_buffered_io (on by > default in 14.2.22) is a mitigation for this problem, but has some > side effects to watch out for (write IOPS amplification, for one). > Fixes for that linked issue went into 14.2.17, 14.2.22, and then > Pacific; we found the 14.2.17 changes to be quite effective by > themselves. > > Even if you don't have stray PGs, trying bluefs_buffered_io might be > an interesting experiment. An alternative would be to compact rocksdb > for each of your OSDs and see if that helps; compacting eliminates the > tombstoned data that can cause problems during iteration, but if you > have a workload that generates a lot of rocksdb tombstones (like PG > cleaning does), then the problem will return a while after compaction. > hmm, I'll try enabling bluefs_buffered_io (it was indeed false) and do the compaction as well anyways.. I'll report back, thanks for the hints! BR nik > Josh > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd daemons still reading disks at full speed while there is no pool activity
Hello Eugen, thank you for you reply. Yes, restarting all OSDs, monitors, also increasing osd_map_cache_size to 5000 (this helped us in case of problem with not pruning OSD maps). none of this helped.. with best regards nik On Wed, Nov 03, 2021 at 11:41:28AM +, Eugen Block wrote: > Hi, > > I don't have an explanation but I remember having a similar issue a > year ago or so. IIRC a simple OSD restart fixed that, so I never got > to the bottom of it. Have you tried to restart OSD daemons? > > > Zitat von Nikola Ciprich : > > >Hello fellow ceph users, > > > >I'm trying to catch ghost here.. On one of our clusters, 6 nodes, > >14.2.15, EC pool 4+2, 6*32 SATA bluestore OSDs we got into very strange > >state. > > > >The cluster is clean (except for pgs not deep-scrubbed in time warning, > >since we've disabled scrubbing while investigating), there is absolutely > >no activity on EC pool, but according to atop, all OSDs are still reading > >furiously, without any apparent reason. even when increasing osd loglevel, > >I don't see anything interesting, except for occasional > >2021-11-03 12:04:52.664 7fb8652e3700 5 osd.0 9347 heartbeat > >osd_stat(store_statfs(0xb80056c/0x26b57/0xe8d7fc0, > >data 0x2f0ddd813e8/0x30b0ee6, compress 0x0/0x0/0x0, omap > >0x98b706, meta 0x26abe48fa), peers > >[1,26,27,34,36,40,44,49,52,55,57,65,69,75,76,78,82,83,87,93,96,97,104,105,107,108,111,112,114,120,121,122,123,135,136,137,143,147,154,156,157,169,171,187,192,196,200,204,208,212,217,218,220,222,224,226,227] > >op hist []) > >and also compactions stats. > > > >trying to sequentially read data from the pool leads to very poor > >performance (ie 8MB/s) > > > >We've had very similar problem on different cluster (replicated, no EC), when > >osdmaps were not pruned correctly, but I checked and those seem to > >be OK, it's just > >OSD are still reading something and I'm unable to find out what. > > > >here's output of crush for one node, others are pretty similar: > > > > -1 2803.19824- 2.7 PiB 609 TiB 607 TiB 1.9 GiB > >1.9 TiB 2.1 PiB 21.78 1.01 -root sata > > -2467.19971- 466 TiB 102 TiB 101 TiB 320 MiB > >328 GiB 364 TiB 21.83 1.01 -host spbstdv1a-sata > > 0 hdd 14.5 1.0 15 TiB 3.1 TiB 3.0 TiB 9.5 MiB > >9.7 GiB 12 TiB 20.98 0.97 51 up osd.0 > > 1 hdd 14.5 1.0 15 TiB 2.4 TiB 2.4 TiB 7.4 MiB > >7.7 GiB 12 TiB 16.34 0.76 50 up osd.1 > > 2 hdd 14.5 1.0 15 TiB 3.5 TiB 3.5 TiB 11 MiB > >11 GiB 11 TiB 24.33 1.13 51 up osd.2 > > 3 hdd 14.5 1.0 15 TiB 2.9 TiB 2.8 TiB 9.3 MiB > >9.1 GiB 12 TiB 19.58 0.91 48 up osd.3 > > 4 hdd 14.5 1.0 15 TiB 3.3 TiB 3.3 TiB 11 MiB > >11 GiB 11 TiB 22.94 1.06 51 up osd.4 > > 5 hdd 14.5 1.0 15 TiB 3.5 TiB 3.5 TiB 12 MiB > >12 GiB 11 TiB 23.94 1.11 50 up osd.5 > > 6 hdd 14.5 1.0 15 TiB 2.8 TiB 2.8 TiB 9.6 MiB > >9.6 GiB 12 TiB 19.11 0.89 49 up osd.6 > > 7 hdd 14.5 1.0 15 TiB 3.4 TiB 3.4 TiB 4.9 MiB > >11 GiB 11 TiB 23.68 1.10 50 up osd.7 > > 8 hdd 14.59998 1.0 15 TiB 3.2 TiB 3.2 TiB 10 MiB > >10 GiB 11 TiB 22.18 1.03 51 up osd.8 > > 9 hdd 14.5 1.0 15 TiB 3.4 TiB 3.4 TiB 4.9 MiB > >11 GiB 11 TiB 23.52 1.09 50 up osd.9 > > 10 hdd 14.5 1.0 15 TiB 2.7 TiB 2.6 TiB 8.5 MiB > >8.5 GiB 12 TiB 18.25 0.85 50 up osd.10 > > 11 hdd 14.5 1.0 15 TiB 3.4 TiB 3.3 TiB 10 MiB > >11 GiB 11 TiB 23.02 1.07 51 up osd.11 > > 12 hdd 14.5 1.0 15 TiB 2.8 TiB 2.8 TiB 10 MiB > >9.7 GiB 12 TiB 19.53 0.91 49 up osd.12 > > 13 hdd 14.5 1.0 15 TiB 3.7 TiB 3.7 TiB 11 MiB > >12 GiB 11 TiB 25.62 1.19 49 up osd.13 > > 14 hdd 14.5 1.0 15 TiB 2.6 TiB 2.6 TiB 8.2 MiB > >8.3 GiB 12 TiB 17.65 0.82 53 up osd.14 > > 15 hdd 14.5 1.0 15 TiB 2.5 TiB 2.5 TiB 7.6 MiB > >7.8 GiB 12 TiB 17.42 0.81 50 up osd.15 > > 16 hdd 14.5 1.0 15 TiB 3.5 TiB 3.5 TiB 11 MiB > >11 GiB 11 TiB 24.37 1.13 50 up osd.16 > > 17 hdd 14.5 1.0 15 TiB 3.5 TiB 3.5 TiB 12 MiB > >12 GiB 11 TiB 24.09 1.12 52 up osd.17 > > 18 hdd 14.5 1.0 15 TiB 2.4 TiB 2.4 TiB 6.9 M
[ceph-users] osd daemons still reading disks at full speed while there is no pool activity
GiB 11 TiB 23.04 1.07 50 up osd.24 25 hdd 14.5 1.0 15 TiB 3.1 TiB 3.1 TiB 10 MiB 9.9 GiB 11 TiB 21.61 1.00 50 up osd.25 162 hdd 14.5 1.0 15 TiB 3.2 TiB 3.2 TiB 10 MiB 10 GiB 11 TiB 21.76 1.01 50 up osd.162 163 hdd 14.5 1.0 15 TiB 3.4 TiB 3.4 TiB 11 MiB 11 GiB 11 TiB 23.60 1.09 50 up osd.163 164 hdd 14.5 1.0 15 TiB 3.5 TiB 3.5 TiB 12 MiB 11 GiB 11 TiB 24.38 1.13 51 up osd.164 165 hdd 14.5 1.0 15 TiB 2.9 TiB 2.9 TiB 9.1 MiB 9.5 GiB 12 TiB 20.18 0.94 50 up osd.165 166 hdd 14.5 1.0 15 TiB 3.3 TiB 3.3 TiB 11 MiB 11 GiB 11 TiB 22.62 1.05 50 up osd.166 167 hdd 14.5 1.0 15 TiB 3.5 TiB 3.5 TiB 12 MiB 12 GiB 11 TiB 24.36 1.13 52 up osd.167 most of OSD settings are defaults, cache autotune, memory_target 4GB etc. there is absolutely no activity on this (or any related) pool, just on one replicated, on different drives, there are about 30MB/s writes. al lboxes are almost idle, have enough RAM. unfortunately OSDs do not use any fast storage for WAL or any DB. anyone met similar problem? Or somebody has hint on how to debug what are OSDs reading all the time? I'd be very grateful with best regards nikola ciprich -- --------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: can't get healthy cluster to trim osdmaps (13.2.8)
Hi Jan, yes, I'm watching this TT as well, I'll post update there (together with quick & dirty patch to get more debugging info) BR nik On Mon, Mar 23, 2020 at 12:12:43PM +0100, Jan Fajerski wrote: > https://tracker.ceph.com/issues/44184 > Looks similar, maybe you're also seeing other symptoms listed there? > In any case would be good to track this in one place. > > On Mon, Mar 23, 2020 at 11:29:53AM +0100, Nikola Ciprich wrote: > >OK, so after some debugging, I've pinned the problem down to > >OSDMonitor::get_trim_to: > > > > std::lock_guard l(creating_pgs_lock); > > if (!creating_pgs.pgs.empty()) { > > return 0; > > } > > > >apparently creating_pgs.pgs.empty() is not true, do I understand it > >correctly that cluster thinks the list of creating pgs is not empty? > > > >all pgs are in clean+active state, so maybe there's something malformed > >in the db? How can I check? > > > >I tried dumping list of creating_pgs according to > >http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030297.html > >but to no avail > > > >On Tue, Mar 17, 2020 at 12:25:29PM +0100, Nikola Ciprich wrote: > >>Hello dear cephers, > >> > >>lately, there's been some discussion about slow requests hanging > >>in "wait for new map" status. At least in my case, it's being caused > >>by osdmaps not being properly trimmed. I tried all possible steps > >>to force osdmap pruning (restarting mons, restarting everyging, > >>poking crushmap), to no avail. Still all OSDs keep min osdmap version > >>1, while newest is 4734. Otherwise cluster is healthy, with no down > >>OSDs, network communication works flawlessly, all seems to be fine. > >>Just can't get old osdmaps to go away.. I's very small cluster and I've > >>moved all production traffic elsewhere, so I'm free to investigate > >>and debug, however I'm out of ideas on what to try or where to look. > >> > >>Any ideas somebody please? > >> > >>The cluster is running 13.2.8 > >> > >>I'd be very grateful for any tips > >> > >>with best regards > >> > >>nikola ciprich > >> > >>-- > >>- > >>Ing. Nikola CIPRICH > >>LinuxBox.cz, s.r.o. > >>28.rijna 168, 709 00 Ostrava > >> > >>tel.: +420 591 166 214 > >>fax:+420 596 621 273 > >>mobil: +420 777 093 799 > >>www.linuxbox.cz > >> > >>mobil servis: +420 737 238 656 > >>email servis: ser...@linuxbox.cz > >>- > >> > > > >-- > >- > >Ing. Nikola CIPRICH > >LinuxBox.cz, s.r.o. > >28.rijna 168, 709 00 Ostrava > > > >tel.: +420 591 166 214 > >fax:+420 596 621 273 > >mobil: +420 777 093 799 > >www.linuxbox.cz > > > >mobil servis: +420 737 238 656 > >email servis: ser...@linuxbox.cz > >- > >___ > >ceph-users mailing list -- ceph-users@ceph.io > >To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Jan Fajerski > Senior Software Engineer Enterprise Storage > SUSE Software Solutions Germany GmbH > Maxfeldstr. 5, 90409 Nürnberg, Germany > (HRB 36809, AG Nürnberg) > Geschäftsführer: Felix Imendörffer > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: can't get healthy cluster to trim osdmaps (13.2.8)
OK, to reply myself :-) I wasn't very smart about decoding the output of "ceph-kvstore-tool get ..." so I added dump of creating_pgs.pgs into get_trim_to function. now I have the list of PGs which seem to be stuck in creating state in monitors DB. If i query them, they're active+clean as I wrote. I suppose I could remove them using ceph-kvstore-tool, right? however I'd rather ask before I proceed: is it safe to remove them from DB, if they all seem to be already created? how do I do it? Stop all monitors, use the tool and start them again? (I've moved all the services to other cluster, so this won't cause any outage) I'd be very grateful for guidance here.. thanks in advance BR nik On Mon, Mar 23, 2020 at 11:29:53AM +0100, Nikola Ciprich wrote: > OK, so after some debugging, I've pinned the problem down to > OSDMonitor::get_trim_to: > > std::lock_guard l(creating_pgs_lock); > if (!creating_pgs.pgs.empty()) { > return 0; > } > > apparently creating_pgs.pgs.empty() is not true, do I understand it > correctly that cluster thinks the list of creating pgs is not empty? > > all pgs are in clean+active state, so maybe there's something malformed > in the db? How can I check? > > I tried dumping list of creating_pgs according to > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030297.html > but to no avail > > On Tue, Mar 17, 2020 at 12:25:29PM +0100, Nikola Ciprich wrote: > > Hello dear cephers, > > > > lately, there's been some discussion about slow requests hanging > > in "wait for new map" status. At least in my case, it's being caused > > by osdmaps not being properly trimmed. I tried all possible steps > > to force osdmap pruning (restarting mons, restarting everyging, > > poking crushmap), to no avail. Still all OSDs keep min osdmap version > > 1, while newest is 4734. Otherwise cluster is healthy, with no down > > OSDs, network communication works flawlessly, all seems to be fine. > > Just can't get old osdmaps to go away.. I's very small cluster and I've > > moved all production traffic elsewhere, so I'm free to investigate > > and debug, however I'm out of ideas on what to try or where to look. > > > > Any ideas somebody please? > > > > The cluster is running 13.2.8 > > > > I'd be very grateful for any tips > > > > with best regards > > > > nikola ciprich > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > > > > -- > - > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax:+420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > - > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: can't get healthy cluster to trim osdmaps (13.2.8)
OK, so after some debugging, I've pinned the problem down to OSDMonitor::get_trim_to: std::lock_guard l(creating_pgs_lock); if (!creating_pgs.pgs.empty()) { return 0; } apparently creating_pgs.pgs.empty() is not true, do I understand it correctly that cluster thinks the list of creating pgs is not empty? all pgs are in clean+active state, so maybe there's something malformed in the db? How can I check? I tried dumping list of creating_pgs according to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030297.html but to no avail On Tue, Mar 17, 2020 at 12:25:29PM +0100, Nikola Ciprich wrote: > Hello dear cephers, > > lately, there's been some discussion about slow requests hanging > in "wait for new map" status. At least in my case, it's being caused > by osdmaps not being properly trimmed. I tried all possible steps > to force osdmap pruning (restarting mons, restarting everyging, > poking crushmap), to no avail. Still all OSDs keep min osdmap version > 1, while newest is 4734. Otherwise cluster is healthy, with no down > OSDs, network communication works flawlessly, all seems to be fine. > Just can't get old osdmaps to go away.. I's very small cluster and I've > moved all production traffic elsewhere, so I'm free to investigate > and debug, however I'm out of ideas on what to try or where to look. > > Any ideas somebody please? > > The cluster is running 13.2.8 > > I'd be very grateful for any tips > > with best regards > > nikola ciprich > > -- > - > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax:+420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > - > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] can't get healthy cluster to trim osdmaps (13.2.8)
Hello dear cephers, lately, there's been some discussion about slow requests hanging in "wait for new map" status. At least in my case, it's being caused by osdmaps not being properly trimmed. I tried all possible steps to force osdmap pruning (restarting mons, restarting everyging, poking crushmap), to no avail. Still all OSDs keep min osdmap version 1, while newest is 4734. Otherwise cluster is healthy, with no down OSDs, network communication works flawlessly, all seems to be fine. Just can't get old osdmaps to go away.. I's very small cluster and I've moved all production traffic elsewhere, so I'm free to investigate and debug, however I'm out of ideas on what to try or where to look. Any ideas somebody please? The cluster is running 13.2.8 I'd be very grateful for any tips with best regards nikola ciprich -- ----- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd_pg_create causing slow requests in Nautilus
Hi Dan, nope, osdmap_first_commited is still 1, it must be some different issue.. I'll report when I have something.. n. On Thu, Mar 12, 2020 at 04:07:26PM +0100, Dan van der Ster wrote: > You have to wait 5 minutes or so after restarting the mon before it > starts trimming. > Otherwise, hmm, I'm not sure. > > -- dan > > On Thu, Mar 12, 2020 at 3:55 PM Nikola Ciprich > wrote: > > > > Hi Dan, > > > > # ceph report 2>/dev/null | jq .osdmap_first_committed > > 1 > > # ceph report 2>/dev/null | jq .osdmap_last_committed > > 4646 > > > > seems like osdmap_first_committed doesn't change at all, restarting mons > > doesn't help.. I don't have any down OSD, everything seems to be healty.. > > > > BR > > > > nik > > > > > > > > > > On Thu, Mar 12, 2020 at 03:23:25PM +0100, Dan van der Ster wrote: > > > If untrimed osdmaps is related, then you should check: > > > https://tracker.ceph.com/issues/37875, particularly #note6 > > > > > > You can see what the mon thinks the valid range of osdmaps is: > > > > > > # ceph report | jq .osdmap_first_committed > > > 113300 > > > # ceph report | jq .osdmap_last_committed > > > 113938 > > > > > > Then the workaround to start trimming is to restart the leader. > > > This shrinks the range on the mon, which then starts telling the osds > > > to trim range. > > > Note that the OSDs will only trim 30 osdmaps for each new osdmap > > > generated -- so if you have a lot of osdmaps to trim, you need to > > > generate more. > > > > > > -- dan > > > > > > > > > On Thu, Mar 12, 2020 at 11:02 AM Nikola Ciprich > > > wrote: > > > > > > > > OK, > > > > > > > > so I can confirm that at least in my case, the problem is caused > > > > by old osd maps not being pruned for some reason, and thus not fitting > > > > into cache. When I increased osd map cache to 5000 the problem is gone. > > > > > > > > The question is why they're not being pruned, even though the cluster > > > > is in > > > > healthy state. But you can try checking: > > > > > > > > ceph daemon osd.X status to see how many maps are your OSDs storing > > > > and ceph daemon osd.X perf dump | grep osd_map_cache_miss > > > > > > > > to see if you're experiencing similar problem.. > > > > > > > > so I'm going to debug further.. > > > > > > > > BR > > > > > > > > nik > > > > > > > > On Thu, Mar 12, 2020 at 09:16:58AM +0100, Nikola Ciprich wrote: > > > > > Hi Paul and others, > > > > > > > > > > while digging deeper, I noticed that when the cluster gets into this > > > > > state, osd_map_cache_miss on OSDs starts growing rapidly.. even when > > > > > I increased osd map cache size to 500 (which was the default at least > > > > > for luminous) it behaves the same.. > > > > > > > > > > I think this could be related.. > > > > > > > > > > I'll try playing more with cache settings.. > > > > > > > > > > BR > > > > > > > > > > nik > > > > > > > > > > > > > > > > > > > > On Wed, Mar 11, 2020 at 03:40:04PM +0100, Paul Emmerich wrote: > > > > > > Encountered this one again today, I've updated the issue with new > > > > > > information: https://tracker.ceph.com/issues/44184 > > > > > > > > > > > > > > > > > > Paul > > > > > > > > > > > > -- > > > > > > Paul Emmerich > > > > > > > > > > > > Looking for help with your Ceph cluster? Contact us at > > > > > > https://croit.io > > > > > > > > > > > > croit GmbH > > > > > > Freseniusstr. 31h > > > > > > 81247 München > > > > > > www.croit.io > > > > > > Tel: +49 89 1896585 90 > > > > > > > > > > > > On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich > > > > > > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I just wa
[ceph-users] Re: osd_pg_create causing slow requests in Nautilus
Hi Dan, # ceph report 2>/dev/null | jq .osdmap_first_committed 1 # ceph report 2>/dev/null | jq .osdmap_last_committed 4646 seems like osdmap_first_committed doesn't change at all, restarting mons doesn't help.. I don't have any down OSD, everything seems to be healty.. BR nik On Thu, Mar 12, 2020 at 03:23:25PM +0100, Dan van der Ster wrote: > If untrimed osdmaps is related, then you should check: > https://tracker.ceph.com/issues/37875, particularly #note6 > > You can see what the mon thinks the valid range of osdmaps is: > > # ceph report | jq .osdmap_first_committed > 113300 > # ceph report | jq .osdmap_last_committed > 113938 > > Then the workaround to start trimming is to restart the leader. > This shrinks the range on the mon, which then starts telling the osds > to trim range. > Note that the OSDs will only trim 30 osdmaps for each new osdmap > generated -- so if you have a lot of osdmaps to trim, you need to > generate more. > > -- dan > > > On Thu, Mar 12, 2020 at 11:02 AM Nikola Ciprich > wrote: > > > > OK, > > > > so I can confirm that at least in my case, the problem is caused > > by old osd maps not being pruned for some reason, and thus not fitting > > into cache. When I increased osd map cache to 5000 the problem is gone. > > > > The question is why they're not being pruned, even though the cluster is in > > healthy state. But you can try checking: > > > > ceph daemon osd.X status to see how many maps are your OSDs storing > > and ceph daemon osd.X perf dump | grep osd_map_cache_miss > > > > to see if you're experiencing similar problem.. > > > > so I'm going to debug further.. > > > > BR > > > > nik > > > > On Thu, Mar 12, 2020 at 09:16:58AM +0100, Nikola Ciprich wrote: > > > Hi Paul and others, > > > > > > while digging deeper, I noticed that when the cluster gets into this > > > state, osd_map_cache_miss on OSDs starts growing rapidly.. even when > > > I increased osd map cache size to 500 (which was the default at least > > > for luminous) it behaves the same.. > > > > > > I think this could be related.. > > > > > > I'll try playing more with cache settings.. > > > > > > BR > > > > > > nik > > > > > > > > > > > > On Wed, Mar 11, 2020 at 03:40:04PM +0100, Paul Emmerich wrote: > > > > Encountered this one again today, I've updated the issue with new > > > > information: https://tracker.ceph.com/issues/44184 > > > > > > > > > > > > Paul > > > > > > > > -- > > > > Paul Emmerich > > > > > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > > > > > croit GmbH > > > > Freseniusstr. 31h > > > > 81247 München > > > > www.croit.io > > > > Tel: +49 89 1896585 90 > > > > > > > > On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > I just wanted to report we've just hit very similar problem.. on mimic > > > > > (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow > > > > > ops caused by waiting for new map. It seems those are slowed by SATA > > > > > OSDs which keep being 100% busy reading for long time until all ops > > > > > are gone, > > > > > blocking OPS on unrelated NVME pools - SATA pools are completely > > > > > unused now. > > > > > > > > > > is this possible that those maps are being requested from slow SATA > > > > > OSDs > > > > > and it takes such a long time for some reason? why could it take so > > > > > long? > > > > > the cluster is very small with very light load.. > > > > > > > > > > BR > > > > > > > > > > nik > > > > > > > > > > > > > > > > > > > > On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote: > > > > > > > > > > > > > > > > > > On 2/19/20 9:34 AM, Paul Emmerich wrote: > > > > > > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander > > > > > > > wrote: > > > > > > >> > > > > > > >> > > > > > >
[ceph-users] Re: osd_pg_create causing slow requests in Nautilus
OK, so I can confirm that at least in my case, the problem is caused by old osd maps not being pruned for some reason, and thus not fitting into cache. When I increased osd map cache to 5000 the problem is gone. The question is why they're not being pruned, even though the cluster is in healthy state. But you can try checking: ceph daemon osd.X status to see how many maps are your OSDs storing and ceph daemon osd.X perf dump | grep osd_map_cache_miss to see if you're experiencing similar problem.. so I'm going to debug further.. BR nik On Thu, Mar 12, 2020 at 09:16:58AM +0100, Nikola Ciprich wrote: > Hi Paul and others, > > while digging deeper, I noticed that when the cluster gets into this > state, osd_map_cache_miss on OSDs starts growing rapidly.. even when > I increased osd map cache size to 500 (which was the default at least > for luminous) it behaves the same.. > > I think this could be related.. > > I'll try playing more with cache settings.. > > BR > > nik > > > > On Wed, Mar 11, 2020 at 03:40:04PM +0100, Paul Emmerich wrote: > > Encountered this one again today, I've updated the issue with new > > information: https://tracker.ceph.com/issues/44184 > > > > > > Paul > > > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > > > > On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich > > wrote: > > > > > > Hi, > > > > > > I just wanted to report we've just hit very similar problem.. on mimic > > > (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow > > > ops caused by waiting for new map. It seems those are slowed by SATA > > > OSDs which keep being 100% busy reading for long time until all ops are > > > gone, > > > blocking OPS on unrelated NVME pools - SATA pools are completely unused > > > now. > > > > > > is this possible that those maps are being requested from slow SATA OSDs > > > and it takes such a long time for some reason? why could it take so long? > > > the cluster is very small with very light load.. > > > > > > BR > > > > > > nik > > > > > > > > > > > > On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote: > > > > > > > > > > > > On 2/19/20 9:34 AM, Paul Emmerich wrote: > > > > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander > > > > > wrote: > > > > >> > > > > >> > > > > >> > > > > >> On 2/18/20 6:54 PM, Paul Emmerich wrote: > > > > >>> I've also seen this problem on Nautilus with no obvious reason for > > > > >>> the > > > > >>> slowness once. > > > > >> > > > > >> Did this resolve itself? Or did you remove the pool? > > > > > > > > > > I've seen this twice on the same cluster, it fixed itself the first > > > > > time (maybe with some OSD restarts?) and the other time I removed the > > > > > pool after a few minutes because the OSDs were running into heartbeat > > > > > timeouts. There unfortunately seems to be no way to reproduce this :( > > > > > > > > > > > > > Yes, that's the problem. I've been trying to reproduce it, but I can't. > > > > It works on all my Nautilus systems except for this one. > > > > > > > > As you saw it, Bryan saw it, I expect others to encounter this at some > > > > point as well. > > > > > > > > I don't have any extensive logging as this cluster is in production and > > > > I can't simply crank up the logging and try again. > > > > > > > > > In this case it wasn't a new pool that caused problems but a very old > > > > > one. > > > > > > > > > > > > > > > Paul > > > > > > > > > >> > > > > >>> In my case it was a rather old cluster that was upgraded all the way > > > > >>> from firefly > > > > >>> > > > > >>> > > > > >> > > > > >> This cluster has also been installed with Firefly. It was installed > > > > >> in > > > > >>
[ceph-users] Re: osd_pg_create causing slow requests in Nautilus
Hi Paul and others, while digging deeper, I noticed that when the cluster gets into this state, osd_map_cache_miss on OSDs starts growing rapidly.. even when I increased osd map cache size to 500 (which was the default at least for luminous) it behaves the same.. I think this could be related.. I'll try playing more with cache settings.. BR nik On Wed, Mar 11, 2020 at 03:40:04PM +0100, Paul Emmerich wrote: > Encountered this one again today, I've updated the issue with new > information: https://tracker.ceph.com/issues/44184 > > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Sat, Feb 29, 2020 at 10:21 PM Nikola Ciprich > wrote: > > > > Hi, > > > > I just wanted to report we've just hit very similar problem.. on mimic > > (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow > > ops caused by waiting for new map. It seems those are slowed by SATA > > OSDs which keep being 100% busy reading for long time until all ops are > > gone, > > blocking OPS on unrelated NVME pools - SATA pools are completely unused now. > > > > is this possible that those maps are being requested from slow SATA OSDs > > and it takes such a long time for some reason? why could it take so long? > > the cluster is very small with very light load.. > > > > BR > > > > nik > > > > > > > > On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote: > > > > > > > > > On 2/19/20 9:34 AM, Paul Emmerich wrote: > > > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander > > > > wrote: > > > >> > > > >> > > > >> > > > >> On 2/18/20 6:54 PM, Paul Emmerich wrote: > > > >>> I've also seen this problem on Nautilus with no obvious reason for the > > > >>> slowness once. > > > >> > > > >> Did this resolve itself? Or did you remove the pool? > > > > > > > > I've seen this twice on the same cluster, it fixed itself the first > > > > time (maybe with some OSD restarts?) and the other time I removed the > > > > pool after a few minutes because the OSDs were running into heartbeat > > > > timeouts. There unfortunately seems to be no way to reproduce this :( > > > > > > > > > > Yes, that's the problem. I've been trying to reproduce it, but I can't. > > > It works on all my Nautilus systems except for this one. > > > > > > As you saw it, Bryan saw it, I expect others to encounter this at some > > > point as well. > > > > > > I don't have any extensive logging as this cluster is in production and > > > I can't simply crank up the logging and try again. > > > > > > > In this case it wasn't a new pool that caused problems but a very old > > > > one. > > > > > > > > > > > > Paul > > > > > > > >> > > > >>> In my case it was a rather old cluster that was upgraded all the way > > > >>> from firefly > > > >>> > > > >>> > > > >> > > > >> This cluster has also been installed with Firefly. It was installed in > > > >> 2015, so a while ago. > > > >> > > > >> Wido > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > > -- > > - > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax:+420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: ser...@linuxbox.cz > > - > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: osd_pg_create causing slow requests in Nautilus
Hi, I just wanted to report we've just hit very similar problem.. on mimic (13.2.6). Any manipulation with OSD (ie restart) causes lot of slow ops caused by waiting for new map. It seems those are slowed by SATA OSDs which keep being 100% busy reading for long time until all ops are gone, blocking OPS on unrelated NVME pools - SATA pools are completely unused now. is this possible that those maps are being requested from slow SATA OSDs and it takes such a long time for some reason? why could it take so long? the cluster is very small with very light load.. BR nik On Wed, Feb 19, 2020 at 10:03:35AM +0100, Wido den Hollander wrote: > > > On 2/19/20 9:34 AM, Paul Emmerich wrote: > > On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander wrote: > >> > >> > >> > >> On 2/18/20 6:54 PM, Paul Emmerich wrote: > >>> I've also seen this problem on Nautilus with no obvious reason for the > >>> slowness once. > >> > >> Did this resolve itself? Or did you remove the pool? > > > > I've seen this twice on the same cluster, it fixed itself the first > > time (maybe with some OSD restarts?) and the other time I removed the > > pool after a few minutes because the OSDs were running into heartbeat > > timeouts. There unfortunately seems to be no way to reproduce this :( > > > > Yes, that's the problem. I've been trying to reproduce it, but I can't. > It works on all my Nautilus systems except for this one. > > As you saw it, Bryan saw it, I expect others to encounter this at some > point as well. > > I don't have any extensive logging as this cluster is in production and > I can't simply crank up the logging and try again. > > > In this case it wasn't a new pool that caused problems but a very old one. > > > > > > Paul > > > >> > >>> In my case it was a rather old cluster that was upgraded all the way > >>> from firefly > >>> > >>> > >> > >> This cluster has also been installed with Firefly. It was installed in > >> 2015, so a while ago. > >> > >> Wido > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- - Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax:+420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz - ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io