[ceph-users] Re: OSD not created after replacing failed disk
Hello Vlad, - add the following to your yaml and apply it: unmanaged: true - go to the server that hosts the failed OSD - fire up cephadm shell, if it's not there install it and give the server the _admin label: ceph orch host label add servername _admin - ceph orch osd rm 494 - ceph-volume lvm deactivate 494 - ceph-volume lvm zap --destroy --osd-id 494 - leave cephadm shell - check if db, wal and osd were removed on the server (lsblk, vgs, lvs) - if not remove the volumes by hand with lvremove - set unmanaged: false and apply the yaml Best, Malte Am 07.07.22 um 20:55 schrieb Vladimir Brik: Hello I am running 17.2.1. We had a disk failure and I followed https://docs.ceph.com/en/quincy/cephadm/services/osd/ to replace the OSD but it didn't work. I replaced the failed disk, ran "ceph orch osd rm 494 --replace --zap", which stopped and removed the daemon from "ceph orch ps", and deleted the WAL/DB LVM volume of the OSD from the NVMe device shared with other OSDs. "ceph status" says 710 OSDs total, 709 up. So far so good. BUT "ceph status" shows osd.494 as stray, even though it is not running on the host, its systemd files have been cleaned up, and "cephadm ls" doesn't show it. A new OSD is not being created. The logs have entries about osd claims for ID 494 but nothing is happening. Re-applying the drive group spec below didn't result in anything: service_type: osd service_id: r740xd2-mk2-hdd service_name: osd.r740xd2-mk2-hdd placement: label: r740xd2-mk2 spec: data_devices: rotational: 1 db_devices: paths: - /dev/nvme0n1 Did I do something incorrectly? What do I need to do to re-create the failed OSD? Vlad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [ext] Re: snap_schedule MGR module not available after upgrade to Quincy
Hey Andreas, Indeed, we were also possible to remove the legacy schedule DB and the scheduler is now picking up the work again. Wouldn't have known where to look for it. Thanks for your help and all the details. I really appreciate it. Best, Mathias On 7/7/2022 11:46 AM, Andreas Teuchert wrote: > Hello Mathias, > > On 06.07.22 18:27, Kuhring, Mathias wrote: >> Hey Andreas, >> >> thanks for the info. >> >> We also had our MGR reporting crashes related to the module. >> >> We have a second cluster as mirror which we also updated to Quincy. >> But there the MGR is able to use the snap_module (so "ceph fs >> snap-schedule status" etc are not complaining). >> And I'm able to schedule snapshots. But we didn't had any schedules >> there before the upgrade (due to being the mirror). > > I think in that case there is no RADOS object for the legacy schedule > DB, which is handled gracefully by the code. > >> >> I also noticed that this particular part of the code you mentioned >> hasn't been touched in a year and half: >> https://github.com/ceph/ceph/blame/ec95624474b1871a821a912b8c3af68f8f8e7aa1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 >> >> >> > > The relevant change was made 17 months ago but it was not backported > to Pacific and is only included in Quincy. > >> So I'm wondering if my previous schedule entries got somehow >> incompatible with the new version. > > The schedule entries are still the same. What changed is that the > sqlite DB that they are stored in, is no longer stored as a DB dump in > a RADOS object in the FS's metadata pool. Instead now the sqlite Ceph > VFS driver is used to store the DB in the metadata pool. > >> >> Do you know if there is any way to reset/cleanup the modules config / >> database? >> So remove all the previously scheduled snapshots but without using "fs >> snap-schedule remove"? >> We only have a handful of schedules which can easily be recreated. >> So maybe a clean start would be at least workaround. > > We could just solve the problem by deleting the legacy schedule DB > after the upgrade: > > rados -p -N cephfs-snap-schedule rm snap_db_v0 > > Afterwards the mgr has to be restarted/failovered. > > The schedules are still there afterwards because they have already > been migrated to the new DB. > > Thanks to my colleague Chris Glaubitz for figuring out that the object > is in a separate namespace. :-) > >> >> Otherwise we will keep simple cron jobs until these issues are fixed. >> After all, you just need regularly executed mkdir and rmdir to get you >> started. >> >> Best Wishes, >> Mathias >> >> > > > Best regards, > > Andreas > >> >> >> >> On 7/6/2022 5:05 PM, Andreas Teuchert wrote: >>> Hello Mathias and others, >>> >>> I also ran into this problem after upgrading from 16.2.9 to 17.2.1. >>> >>> Additionally I observed a health warning: "3 mgr modules have recently >>> crashed". >>> >>> Those are actually two distinct crashes that are already in the >>> tracker: >>> >>> https://tracker.ceph.com/issues/56269 and >>> https://tracker.ceph.com/issues/56270 >>> >>> Considering that the crashes are in the snap_schedule module I assume >>> that they are the reason why the module is not available. >>> >>> I can reproduce the crash in 56270 by failing over the mgr. >>> >>> I believe that the faulty code causing the error is this line: >>> https://github.com/ceph/ceph/blob/v17.2.1/src/pybind/mgr/snap_schedule/fs/schedule_client.py#L193 >>> >>> >>> >>> Instead of ioctx.remove(SNAP_DB_OBJECT_NAME) it should be >>> ioctx.remove_object(SNAP_DB_OBJECT_NAME). >>> >>> (According to my understanding of >>> https://docs.ceph.com/en/latest/rados/api/python/.) >>> >>> Best regards, >>> >>> Andreas >>> >>> >>> On 01.07.22 18:05, Kuhring, Mathias wrote: Dear Ceph community, After upgrading our cluster to Quincy with cephadm (ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.1), I struggle to re-activate the snapshot schedule module: 0|0[root@osd-1 ~]# ceph mgr module enable snap_schedule 0|1[root@osd-1 ~]# ceph mgr module ls | grep snap snap_schedule on 0|0[root@osd-1 ~]# ceph fs snap-schedule list / --recursive Error ENOENT: Module 'snap_schedule' is not available I tried restarting the MGR daemons and failed over a restarted one. But with no change. 0|0[root@osd-1 ~]# ceph orch restart mgr Scheduled to restart mgr.osd-1 on host 'osd-1' Scheduled to restart mgr.osd-2 on host 'osd-2' Scheduled to restart mgr.osd-3 on host 'osd-3' Scheduled to restart mgr.osd-4.oylrhe on host 'osd-4' Scheduled to restart mgr.osd-5.jcfyqe on host 'osd-5' 0|0[root@osd-1 ~]# ceph orch ps --daemon_type mgr NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mgr.osd-1 osd-1 *:8443,9283 running (61s) 35s ago 9M 402M - 17.2.1 e5af760fa1c1
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
On 7 Jul 2022, at 15:41, Dan van der Ster wrote: > > How is one supposed to redeploy OSDs on a multi-PB cluster while the > performance is degraded? This is very strong point of view! Good that this case can be fixed with set bluestore_prefer_deferred_size_hdd to 128k, and I think we need analyze answer from Igor * this is bug * bluestore_prefer_deferred_size_hdd should be increased by operator, until migration to 4k will finished k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi, On Thu, Jul 7, 2022 at 2:37 PM Konstantin Shalygin wrote: > > Hi, > > On 7 Jul 2022, at 13:04, Dan van der Ster wrote: > > I'm not sure the html mail made it to the lists -- resending in plain text. > I've also opened https://tracker.ceph.com/issues/56488 > > > I think with pacific you need to redeploy all OSD's to respect the new > default bluestore_min_alloc_size_hdd = 4096 [1] > Or not? > Understood, yes, that is another "solution". But it is incredibly impractical, I would say impossible, for loaded production installations. (How is one supposed to redeploy OSDs on a multi-PB cluster while the performance is degraded?) -- Dan > > [1] https://github.com/ceph/ceph/pull/34588 > > k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance in Proof-of-Concept cluster
> > Thanks for sharing. How many nodes/OSDs?, I get the following tail for > the same command and size=3 (3 nodes, 4 OSD each): 4 nodes, 2x ssd per node. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds
Hi again, I'm not sure the html mail made it to the lists -- resending in plain text. I've also opened https://tracker.ceph.com/issues/56488 Cheers, Dan On Wed, Jul 6, 2022 at 11:43 PM Dan van der Ster wrote: > > Hi Igor and others, > > (apologies for html, but i want to share a plot ;) ) > > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados > bench -p test 10 write -b 4096 -t 1" latency probe showed something is very > wrong with deferred writes in pacific. > Here is an example cluster, upgraded today: > > > > The OSDs are 12TB HDDs, formatted in nautilus with the default > bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. > > I found that the performance issue is because 4kB writes are no longer > deferred from those pre-pacific hdds to flash in pacific with the default > config !!! > Here are example bench writes from both releases: > https://pastebin.com/raw/m0yL1H9Z > > I worked out that the issue is fixed if I set > bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. > Note the default was 32k in octopus). > > I think this is related to the fixes in https://tracker.ceph.com/issues/52089 > which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size > 0x1 with bluestore_prefer_deferred_size_hdd (0x1) and the "strictly > less than" condition prevents deferred writes from ever happening. > > So I think this would impact anyone upgrading clusters with hdd/ssd mixed > osds ... surely we must not be the only clusters impacted by this?! > > Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB > or is there in fact a bug here? > > Best Regards, > > Dan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance in Proof-of-Concept cluster
Hi, Run a close to the metal benchmark on the disks first, just to see the theoretical ceiling. Also, rerun your benchmarks with random write, just to get more honest numbers as well. Based on the numbers so far, you seem to be getting 40k client iops @512 threads, due to 3x replication and 3 nodes, this translates 1:1 to 40k per node. So ~10k per SSD. Depending on the benchmark directly on a disk (requested above) this can be either good or bad. You might want to try 2 ceph-osd processes per SSD, just to see if the Ceph process is the bottleneck. Hope this gives you food for thought. On 7/6/22 13:13, Eneko Lacunza wrote: Hi all, We have a proof of concept HCI cluster with Proxmox v7 and Ceph v15. We have 3 nodes: 2x Intel Xeon 5218 Gold (16 core/32 threads per socket) Dell PERC H330 Controller (SAS3) 4xSamsung PM1634 3.84TB SAS 12Gb SSD Network is LACP 2x10Gbps This cluster is used for some VDI tests, with Windows 10 VMs. Pool has size=3/min=2 and is used for RBD (KVM/QEMU VMs) We are seeing Ceph performance reaching about 600MiB/s read and 500MiB/s write, and IOPS read about 6.000 and writes about 2.000 . Read/writes are simultaneous (mixed IO), as reported by Ceph. Is this a reasonable performance for the hardware we have? We see about 25-30% CPU used in the nodes, and ceph-osd processes spiking between 600% and 1000% (I guess it's full 6-10 threads use). I have checked cache for the disks, but they report cache as "Not applicable". BIOS power profile is performance and C states are disabled. Thanks Eneko Lacunza Zuzendari teknikoa | Director técnico Binovo IT Human Project Tel. +34 943 569 206 |https://www.binovo.es Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun https://www.youtube.com/user/CANALBINOVO https://www.linkedin.com/company/37269706/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io