[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade
If it can help, I have recently updated my ceph cluster (composed by 3 mon-mgr nodes and n osd nodes) from Nautilus CentOS7 to Pacific Centos8 stream. First I reinstalled the mon-mgr nodes with Centos8 stream (removing them from the cluster and then re-adding them with the new operating system). This was needed because the mgr on octopus runs only on rhel8 and its forks Then I migrated the cluster to Octopus (so mon-mgr running C8stream and osd nodes running centos7) Then I reinstalled each OSD node with Centos8 Stream, without draining the node [*] Then I migrated the cluster from Octopus to Pacific [*] ceph osd set noout Reinstallation of the node with the C8stream Installation of ceph ceph-volume lvm activate --all Cheers, Massimo On Tue, Dec 6, 2022 at 3:58 PM David C wrote: > Hi All > > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10, > cluster is primarily used for CephFS, mix of Filestore and Bluestore > OSDs, mons/osds collocated, running on CentOS 7 nodes > > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific > > I assume the cleanest way to update the node OS would be to drain the > node and remove from the cluster, install Rocky 8, add back to cluster > as effectively a new node > > I have a relatively short maintenance window and was hoping to speed > up OS upgrade with the following approach on each node: > > - back up ceph config/systemd files etc. > - set noout etc. > - deploy Rocky 8, being careful not to touch OSD block devices > - install Nautilus binaries (ensuring I use same version as pre OS upgrade) > - copy ceph config back over > > In theory I could then start up the daemons and they wouldn't care > that we're now running on a different OS > > Does anyone see any issues with that approach? I plan to test on a dev > cluster anyway but would be grateful for any thoughts > > Thanks, > David > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade
We went on a couple clusters from ceph-deploy+centos7+nautilus to cephadm+rocky8+pacific using ELevate as one of the steps. Went through octopus as well. ELevate wasn't perfect for us either, but was able to get the job done. Had to test it carefully on the test clusters multiple times to get the procedure just right. Had some bumps even then, but was able to get things finished up. Thanks, Kevin From: Wolfpaw - Dale Corse Sent: Tuesday, December 6, 2022 8:18 AM To: 'David C' Cc: 'ceph-users' Subject: [ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade Check twice before you click! This email originated from outside PNNL. Hi David, > Good to hear you had success with the ELevate tool, I'd looked at that but seemed a bit risky. The tool supports Rocky so I may give it a look. Elevate wasn't perfect - we had to manually upgrade some packages from outside repos (ceph, opennebula and salt if memory serves). That said, it was certainly manageable. > This one is surprising since in theory Pacific still supports Filestore, > there is at least one thread on the list where someone upgraded to Pacific > and is still running some Filestore OSDs - > on the other hand, there's also a recent thread where someone ran into > problems and was forced to upgrade to Bluestore - did you experience issues > yourself or was this advice you > picked up? I do ultimately want to get all my OSDs on Bluestore but was > hoping to do that after the Ceph version upgrade. Sorry - I am mistaken about Rocks/LevelDB and Filestore upgrades being required for Pacific. Apologies! I do remember doing all of ours when we upgraded from Luminous -> Nautilus, but I can't remember why to be honest. Might have been advice at the time, or something I read when looking into the upgrade :) Cheers, D. -Original Message- From: David C [mailto:dcsysengin...@gmail.com] Sent: Tuesday, December 6, 2022 8:56 AM To: Wolfpaw - Dale Corse Cc: ceph-users Subject: [SPAM] [ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade Hi Wolfpaw, thanks for the response - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used > AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a > similar path I think. > Good to hear you had success with the ELevate tool, I'd looked at that but seemed a bit risky. The tool supports Rocky so I may give it a look. > > - you will need to love those filestore OSD’s to Bluestore before > hitting Pacific, might even be part of the Nautilus upgrade. This > takes some time if I remember correctly. > This one is surprising since in theory Pacific still supports Filestore, there is at least one thread on the list where someone upgraded to Pacific and is still running some Filestore OSDs - on the other hand, there's also a recent thread where someone ran into problems and was forced to upgrade to Bluestore - did you experience issues yourself or was this advice you picked up? I do ultimately want to get all my OSDs on Bluestore but was hoping to do that after the Ceph version upgrade. > - You may need to upgrade monitors to RocksDB too. Thanks, I wasn't aware of this - I suppose I'll do that when I'm on Nautilus On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse wrote: > We did this (over a longer timespan).. it worked ok. > > A couple things I’d add: > > - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then > used AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky > has a similar path I think. > > - you will need to love those filestore OSD’s to Bluestore before > hitting Pacific, might even be part of the Nautilus upgrade. This > takes some time if I remember correctly. > > - You may need to upgrade monitors to RocksDB too. > > Sent from my iPhone > > > On Dec 6, 2022, at 7:59 AM, David C wrote: > > > > Hi All > > > > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific > > 16.2.10, cluster is primarily used for CephFS, mix of Filestore and > > Bluestore OSDs, mons/osds collocated, running on CentOS 7 nodes > > > > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade > > to > > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific > > > > I assume the cleanest way to update the node OS would be to drain > > the node and remove from the cluster, install Rocky 8, add back to > > cluster as effectively a new node > > > > I have a relatively short maintenance window and was hoping to speed > > up OS upgrade with the following approach on each node: > > > > - back up ceph config/systemd files etc. > > - set noout etc. > > - deploy Rocky 8, being careful not to touch OSD block devices > > - install Nautilus binaries (ensuring I use same version as pre OS > upgrade) > > - copy ceph config back over > > > > In theory I could then start up the daemons and they wouldn't care > > that we're now running on a different OS
[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade
Hi David, > Good to hear you had success with the ELevate tool, I'd looked at that but seemed a bit risky. The tool supports Rocky so I may give it a look. Elevate wasn't perfect - we had to manually upgrade some packages from outside repos (ceph, opennebula and salt if memory serves). That said, it was certainly manageable. > This one is surprising since in theory Pacific still supports Filestore, > there is at least one thread on the list where someone upgraded to Pacific > and is still running some Filestore OSDs - > on the other hand, there's also a recent thread where someone ran into > problems and was forced to upgrade to Bluestore - did you experience issues > yourself or was this advice you > picked up? I do ultimately want to get all my OSDs on Bluestore but was > hoping to do that after the Ceph version upgrade. Sorry - I am mistaken about Rocks/LevelDB and Filestore upgrades being required for Pacific. Apologies! I do remember doing all of ours when we upgraded from Luminous -> Nautilus, but I can't remember why to be honest. Might have been advice at the time, or something I read when looking into the upgrade :) Cheers, D. -Original Message- From: David C [mailto:dcsysengin...@gmail.com] Sent: Tuesday, December 6, 2022 8:56 AM To: Wolfpaw - Dale Corse Cc: ceph-users Subject: [SPAM] [ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade Hi Wolfpaw, thanks for the response - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used > AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a > similar path I think. > Good to hear you had success with the ELevate tool, I'd looked at that but seemed a bit risky. The tool supports Rocky so I may give it a look. > > - you will need to love those filestore OSD’s to Bluestore before > hitting Pacific, might even be part of the Nautilus upgrade. This > takes some time if I remember correctly. > This one is surprising since in theory Pacific still supports Filestore, there is at least one thread on the list where someone upgraded to Pacific and is still running some Filestore OSDs - on the other hand, there's also a recent thread where someone ran into problems and was forced to upgrade to Bluestore - did you experience issues yourself or was this advice you picked up? I do ultimately want to get all my OSDs on Bluestore but was hoping to do that after the Ceph version upgrade. > - You may need to upgrade monitors to RocksDB too. Thanks, I wasn't aware of this - I suppose I'll do that when I'm on Nautilus On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse wrote: > We did this (over a longer timespan).. it worked ok. > > A couple things I’d add: > > - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then > used AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky > has a similar path I think. > > - you will need to love those filestore OSD’s to Bluestore before > hitting Pacific, might even be part of the Nautilus upgrade. This > takes some time if I remember correctly. > > - You may need to upgrade monitors to RocksDB too. > > Sent from my iPhone > > > On Dec 6, 2022, at 7:59 AM, David C wrote: > > > > Hi All > > > > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific > > 16.2.10, cluster is primarily used for CephFS, mix of Filestore and > > Bluestore OSDs, mons/osds collocated, running on CentOS 7 nodes > > > > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade > > to > > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific > > > > I assume the cleanest way to update the node OS would be to drain > > the node and remove from the cluster, install Rocky 8, add back to > > cluster as effectively a new node > > > > I have a relatively short maintenance window and was hoping to speed > > up OS upgrade with the following approach on each node: > > > > - back up ceph config/systemd files etc. > > - set noout etc. > > - deploy Rocky 8, being careful not to touch OSD block devices > > - install Nautilus binaries (ensuring I use same version as pre OS > upgrade) > > - copy ceph config back over > > > > In theory I could then start up the daemons and they wouldn't care > > that we're now running on a different OS > > > > Does anyone see any issues with that approach? I plan to test on a > > dev cluster anyway but would be grateful for any thoughts > > > > Thanks, > > David > > ___ > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > > email to ceph-users-le...@ceph.io > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade
> > I don't think this is necessary. It _is_ necessary to convert all > leveldb to rocksdb before upgrading to Pacific, on both mons and any > filestore OSDs. Thanks, Josh, I guess that explains why some people had issues with Filestore OSDs post Pacific upgrade On Tue, Dec 6, 2022 at 4:07 PM Josh Baergen wrote: > > - you will need to love those filestore OSD’s to Bluestore before > hitting Pacific, might even be part of the Nautilus upgrade. This takes > some time if I remember correctly. > > I don't think this is necessary. It _is_ necessary to convert all > leveldb to rocksdb before upgrading to Pacific, on both mons and any > filestore OSDs. > > Quincy will warn you about filestore OSDs, and Reef will no longer > support filestore. > > Josh > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade
> - you will need to love those filestore OSD’s to Bluestore before hitting > Pacific, might even be part of the Nautilus upgrade. This takes some time if > I remember correctly. I don't think this is necessary. It _is_ necessary to convert all leveldb to rocksdb before upgrading to Pacific, on both mons and any filestore OSDs. Quincy will warn you about filestore OSDs, and Reef will no longer support filestore. Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade
Hi Wolfpaw, thanks for the response - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used > AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a > similar path I think. > Good to hear you had success with the ELevate tool, I'd looked at that but seemed a bit risky. The tool supports Rocky so I may give it a look. > > - you will need to love those filestore OSD’s to Bluestore before hitting > Pacific, might even be part of the Nautilus upgrade. This takes some time > if I remember correctly. > This one is surprising since in theory Pacific still supports Filestore, there is at least one thread on the list where someone upgraded to Pacific and is still running some Filestore OSDs - on the other hand, there's also a recent thread where someone ran into problems and was forced to upgrade to Bluestore - did you experience issues yourself or was this advice you picked up? I do ultimately want to get all my OSDs on Bluestore but was hoping to do that after the Ceph version upgrade. > - You may need to upgrade monitors to RocksDB too. Thanks, I wasn't aware of this - I suppose I'll do that when I'm on Nautilus On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse wrote: > We did this (over a longer timespan).. it worked ok. > > A couple things I’d add: > > - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used > AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a > similar path I think. > > - you will need to love those filestore OSD’s to Bluestore before hitting > Pacific, might even be part of the Nautilus upgrade. This takes some time > if I remember correctly. > > - You may need to upgrade monitors to RocksDB too. > > Sent from my iPhone > > > On Dec 6, 2022, at 7:59 AM, David C wrote: > > > > Hi All > > > > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10, > > cluster is primarily used for CephFS, mix of Filestore and Bluestore > > OSDs, mons/osds collocated, running on CentOS 7 nodes > > > > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to > > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific > > > > I assume the cleanest way to update the node OS would be to drain the > > node and remove from the cluster, install Rocky 8, add back to cluster > > as effectively a new node > > > > I have a relatively short maintenance window and was hoping to speed > > up OS upgrade with the following approach on each node: > > > > - back up ceph config/systemd files etc. > > - set noout etc. > > - deploy Rocky 8, being careful not to touch OSD block devices > > - install Nautilus binaries (ensuring I use same version as pre OS > upgrade) > > - copy ceph config back over > > > > In theory I could then start up the daemons and they wouldn't care > > that we're now running on a different OS > > > > Does anyone see any issues with that approach? I plan to test on a dev > > cluster anyway but would be grateful for any thoughts > > > > Thanks, > > David > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph upgrade advice - Luminous to Pacific with OS upgrade
On 12/6/22 15:58, David C wrote: Hi All I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10, cluster is primarily used for CephFS, mix of Filestore and Bluestore OSDs, mons/osds collocated, running on CentOS 7 nodes My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to EL8 on the nodes (probably Rocky) -> Upgrade to Pacific I assume the cleanest way to update the node OS would be to drain the node and remove from the cluster, install Rocky 8, add back to cluster as effectively a new node I have a relatively short maintenance window and was hoping to speed up OS upgrade with the following approach on each node: - back up ceph config/systemd files etc. - set noout etc. - deploy Rocky 8, being careful not to touch OSD block devices - install Nautilus binaries (ensuring I use same version as pre OS upgrade) - copy ceph config back over In theory I could then start up the daemons and they wouldn't care that we're now running on a different OS Does anyone see any issues with that approach? I plan to test on a dev cluster anyway but would be grateful for any thoughts That would work. Just run: systemctl enable ceph-osd.target ceph-volume lvm activate --all on them and you should be good to go. I have done re-install from 16.04 to 20.04 this way and that just worked (TM). Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade
We did this (over a longer timespan).. it worked ok. A couple things I’d add: - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a similar path I think. - you will need to love those filestore OSD’s to Bluestore before hitting Pacific, might even be part of the Nautilus upgrade. This takes some time if I remember correctly. - You may need to upgrade monitors to RocksDB too. Sent from my iPhone > On Dec 6, 2022, at 7:59 AM, David C wrote: > > Hi All > > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10, > cluster is primarily used for CephFS, mix of Filestore and Bluestore > OSDs, mons/osds collocated, running on CentOS 7 nodes > > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific > > I assume the cleanest way to update the node OS would be to drain the > node and remove from the cluster, install Rocky 8, add back to cluster > as effectively a new node > > I have a relatively short maintenance window and was hoping to speed > up OS upgrade with the following approach on each node: > > - back up ceph config/systemd files etc. > - set noout etc. > - deploy Rocky 8, being careful not to touch OSD block devices > - install Nautilus binaries (ensuring I use same version as pre OS upgrade) > - copy ceph config back over > > In theory I could then start up the daemons and they wouldn't care > that we're now running on a different OS > > Does anyone see any issues with that approach? I plan to test on a dev > cluster anyway but would be grateful for any thoughts > > Thanks, > David > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph upgrade advice - Luminous to Pacific with OS upgrade
Hi All I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10, cluster is primarily used for CephFS, mix of Filestore and Bluestore OSDs, mons/osds collocated, running on CentOS 7 nodes My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to EL8 on the nodes (probably Rocky) -> Upgrade to Pacific I assume the cleanest way to update the node OS would be to drain the node and remove from the cluster, install Rocky 8, add back to cluster as effectively a new node I have a relatively short maintenance window and was hoping to speed up OS upgrade with the following approach on each node: - back up ceph config/systemd files etc. - set noout etc. - deploy Rocky 8, being careful not to touch OSD block devices - install Nautilus binaries (ensuring I use same version as pre OS upgrade) - copy ceph config back over In theory I could then start up the daemons and they wouldn't care that we're now running on a different OS Does anyone see any issues with that approach? I plan to test on a dev cluster anyway but would be grateful for any thoughts Thanks, David ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)
Hi Janne, that is a really good idea. Thank you. I just saw, that our only ubuntu20.04 got very high %util (all 8TB disks) Devicer/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util sdc 19.00112.00 0.00 0.000.32 5.89 1535.00 68768.00 1260.00 45.081.3344.800.00 0.00 0.00 0.000.00 0.001.44 76.00 sdd 62.00 5892.0043.00 40.952.8295.03 1196.00 78708.00 1361.00 53.232.3565.810.00 0.00 0.00 0.000.00 0.002.31 72.00 sde 33.00184.00 0.00 0.000.33 5.58 1413.00 102592.00 1709.00 54.741.7072.610.00 0.00 0.00 0.000.00 0.001.68 84.40 sdf 62.00 8200.0063.00 50.409.32 132.26 1066.00 74372.00 1173.00 52.391.6869.770.00 0.00 0.00 0.000.00 0.001.80 70.00 sdg 5.00 40.00 0.00 0.000.40 8.00 1936.00 128188.00 2172.00 52.872.1866.210.00 0.00 0.00 0.000.00 0.003.21 92.80 sdh133.00 8636.0044.00 24.864.1464.93 1505.00 87820.00 1646.00 52.240.9558.350.00 0.00 0.00 0.000.00 0.001.09 78.80 I've cross checked the other 8TB disks in our cluster, which are around 30-50% with roughly the same IOPs. Maybe I am missing some optimization, that is done on the centos7 nodes, but not on the ubuntu20.04 node. (If you know something from the top of your head, I am happy to hear it). Maybe it is just another measuring on ubuntu. But this was the first node where I restarted the OSDs and this is where I waited the longest time, to see if anything is going better. The problem nearly disappeared in a couple of seconds, after the last OSD was restarted. So I would not blame that node in particular, but I will investigate in this direction. Am Di., 6. Dez. 2022 um 10:08 Uhr schrieb Janne Johansson < icepic...@gmail.com>: > Perhaps run "iostat -xtcy 5" on the OSD hosts to > see if any of the drives have weirdly high utilization despite low > iops/requests? > > > Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens : > > > > Hi Sven, > > I am searching really hard for defect hardware, but I am currently out of > > ideas: > > - checked prometheus stats, but in all that data I don't know what to > look > > for (osd apply latency if very low at the mentioned point and went up to > > 40ms after all OSDs were restarted) > > - smartctl shows nothing > > - dmesg show nothing > > - network data shows nothing > > - osd and clusterlogs show nothing > > > > If anybody got a good tip what I can check, that would be awesome. A > string > > in the logs (I made a copy from that days logs), or a tool to fire > against > > the hardware. I am 100% out of ideas what it could be. > > In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am > > waiting for the replicas to do their work" (log message 'waiting for sub > > ops'). But there was no alert that any OSD had connection problems to > other > > OSDs. Additional the cluster_network is the same interface, switch, > > everything as public_network. Only difference is the VLAN id (I plan to > > remove the cluster_network because it does not provide anything for us). > > > > I am also planning to update all hosts from centos7 to ubuntu 20.04 > (newer > > kernel, standardized OS config and so on). > > > > Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske < > s.kie...@mittwald.de > > >: > > > > > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > > > > hi, > > > > maybe someone here can help me to debug an issue we faced today. > > > > > > > > Today one of our clusters came to a grinding halt with 2/3 of our > OSDs > > > > reporting slow ops. > > > > Only option to get it back to work fast, was to restart all OSDs > daemons. > > > > > > > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last > work > > > > on the cluster: synced in a node 4 days ago. > > > > > > > > The only health issue, that was reported, was the SLOW_OPS. No slow > pings > > > > on the networks. No restarting OSDs. Nothing. > > > > > > > > I was able to ping it to a 20s timeframe and I read ALL the logs in > a 20 > > > > minute timeframe around this issue. > > > > > > > > I haven't found any clues. > > > > > > > > Maybe someone encountered this in the past? > > > > > > do you happen to run your rocksdb on a dedicated caching device (nvme > ssd)? > > > > > > I observed slow ops in octopus after a faulty nvme ssd was inserted in > one > > > ceph server. > > > as was said in other mails, try to isolate your root cause. > > > > > > maybe the node added 4 days ago was the culprit here? > > > > > > we were able to pinpoint the nvme by monitoring the slow osds > > > and the commonality in this case was the same nvme cache
[ceph-users] Orchestrator hanging on 'stuck' nodes
Dear all, We're having an odd problem with a recently installed Quincy/cephadm cluster on CentOS 8 Stream with Podman, where the orchestrator appears to get wedged and just won't implement any changes. The overall cluster was installed and working for a few weeks, then we added an NFS export which worked for a bit, then we had some problems with that and tried to restart/redeploy it and found that the orchestrator wouldn't deploy new NFS server containers. We then made an attempt to restart the MGR process(es) by stopping one and having the Orchestrator redeploy it, but it didn't. The overall effect looks like orchestrator won't try to start containers - it knows what it's supposed to be doing (and you can tell it to do new things, e.g. deploy a new NFS cluster, and that's reflected correctly in both CLI and web control panel), but it just doesn't actually deploy things. This looks a bit like this Reddit post: https://www.reddit.com/r/ceph/comments/v3kdix/cephadm_not_deploying_new_mgr_daemons_to_match/ And this mailing list post: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YREK7HUBNIMTKR5GU5L5E5CNFI7FDKLF/ We've found that in our case this appears to be due to one node in the cluster being in a strange state; it's not always the same node, and it doesn't have to be either the node running the MGR, or the node(s) being targetted to start the new containers on, *any* node in the system being in this state will wedge the orchestrator. A 'stuck' node can't start a new local container with './cephadm shell', and it sometimes but not always appears in the Cluster->Hosts section of the web interface with a blank machine model name and 'NaN' for its capacity (I'm guessing that these values are cached and then time out after a while?). Already running containers on the node (e.g. OSDs) appear to carry on working. As well as failing to start containers, while in this state the orchestrator will also fail to copy /etc/ceph/* to a new node with the '_admin' tag. Rebooting the 'stuck' node instantly unwedges the orchestrator as soon as the 'stuck' node goes down - it doesn't have to come up working, it just has to stop - as soon as the 'stuck' node is down the orchestrator catches up on outstanding requests, starts new containers and brings everything into line with the requested state. My current best guess is that the first thing the orchestrator tries to do when it needs to change something is to enumerate the nodes to see where it can start things, it hangs trying to query the 'stuck' node, and can't recover on its own. I've included more details and command outputs below, but: - Does that sound feasible? - Does this sound familiar to anyone? - Does anyone know how to fix it? - Or how to narrow down the root cause to turn this into a proper bug report? Ewan While in the wedged state the orchestrator thinks it's fine: # ceph orch status --detail Backend: cephadm Available: Yes Paused: No Host Parallelism: 10 The cluster overall health is fine: # ceph -s cluster: id: 58140ed2-4ed4-11ed-b4db-5c6f69756a60 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-r3n4,ceph-r1n4,ceph-r2n4,ceph-r1n5,ceph-r2n5 (age 2w) mgr: ceph-r1n4.mgqrwx(active, since 2d) mds: 1/1 daemons up, 3 standby osd: 294 osds: 294 up (since 2w), 294 in (since 5w) data: volumes: 1/1 healthy pools: 5 pools, 9281 pgs objects: 252.26M objects, 649 TiB usage: 2.1 PiB used, 2.2 PiB / 4.3 PiB avail pgs: 9269 active+clean 12 active+clean+scrubbing+deep This was tested by deploying new NFS serice in existing cluster and then a whole new NFS cluster (nfs.cephnfstwo) and removing the original NFS cluster (nfs.cephnfsone) - the change made from CLI reflected in Web dashboard, but not actioned; e.g. only one MGR running of target two, nothing running for NFS cluster 'cephnfstwo', original 'nfs.cephnfsone' cluster shown as 'deleting' but not acutally gone: # ceph orch ls NAMEPORTSRUNNING REFRESHED AGE PLACEMENT alertmanager?:9093,9094 1/1 2d ago 5w count:1 crash 21/21 4w ago 5w * grafana ?:3000 1/1 2d ago 5w count:1 mds.mds1 4/4 4w ago 5w count:4 mgr 1/2 2w ago 2w count:2 mon 5/5 2w ago 5w count:5 nfs.cephnfsone 0/12w ceph-r1n5;ceph-r2n5;count:1 nfs.cephnfstwo ?:2049 0/1 - 2d count:1 node-exporter ?:9100 21/21 4w ago 5w * osd 294 4w ago - prometheus ?:9095 1/1 2d ago 5w count:1 The 'stuck' node is continuing to run services including OSDs, and in cases where it's also had the active MGR then the web interface remains accessible, but SSHing in and trying to start a cephadm shell
[ceph-users] Re: cephfs snap-mirror stalled
On Tue, Dec 6, 2022 at 6:34 PM Holger Naundorf wrote: > > > > On 06.12.22 09:54, Venky Shankar wrote: > > Hi Holger, > > > > On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf > > wrote: > >> > >> Hello, > >> we have set up a snap-mirror for a directory on one of our clusters - > >> running ceph version > >> > >> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > >> (stable) > >> > >> to get mirrorred our other cluster - running ceph version > >> > >> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific > >> (stable) > >> > >> The initial setup went ok, when the first snapshot was created data > >> started to flow at a decent (for our HW) rate of 100-200MB/s. As the > >> directory contains ~200TB this was expected to take some time - but now > >> the process has stalled completely after ~100TB were mirrored and ~7d > >> running. > >> > >> Up to now I do not have any hints why it has stopped - I do not see any > >> error messages from the cephfs-mirror daemon. Can the small version > >> mismatch be a problem? > >> > >> Any hints where to look to find out what has got stuck are welcome. > > > > I'd look at the mirror daemon logs for any errors to start with. You > > might want to crank up the log level for debugging (debug > > cephfs_mirror=20). > > > > Even on max debug I do not see anything which looks like an error - but > as this is the first time I try to dig into any cephfs-mirror logs I > might not notice (as long as it is not red and flashing). > > The Log basically this type of sequence, repeating forever: > > (...) > cephfs::mirror::MirrorWatcher handle_notify > cephfs::mirror::Mirror update_fs_mirrors > cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror > update (0x556fe3a7f130) after 2 seconds > cephfs::mirror::Watcher handle_notify: notify_id=751516198184655, > handle=93939050205568, notifier_id=25504530 > cephfs::mirror::MirrorWatcher handle_notify > cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run: > trying to pick from 1 directories > cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) > pick_directory > cephfs::mirror::Watcher handle_notify: notify_id=751516198184656, > handle=93939050205568, notifier_id=25504530 > cephfs::mirror::MirrorWatcher handle_notify > cephfs::mirror::Mirror update_fs_mirrors > cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror > update (0x556fe3a7fc70) after 2 seconds > cephfs::mirror::Watcher handle_notify: notify_id=751516198184657, > handle=93939050205568, notifier_id=25504530 > cephfs::mirror::MirrorWatcher handle_notify > (...) Basically, the interesting bit is not captured since it probably happened sometime back. Could you please set the following: debug cephfs_mirror = 20 debug client = 20 and restart the mirror daemon? The daemon would start synchronizing again. When synchronizing stalls, please share the daemon logs. If the log is huge, you could upload them via ceph-post-file. > > > > >> > >> Regards, > >> Holger > >> > >> -- > >> Dr. Holger Naundorf > >> Christian-Albrechts-Universität zu Kiel > >> Rechenzentrum / HPC / Server und Storage > >> Tel: +49 431 880-1990 > >> Fax: +49 431 880-1523 > >> naund...@rz.uni-kiel.de > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > > > -- > Dr. Holger Naundorf > Christian-Albrechts-Universität zu Kiel > Rechenzentrum / HPC / Server und Storage > Tel: +49 431 880-1990 > Fax: +49 431 880-1523 > naund...@rz.uni-kiel.de -- Cheers, Venky ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs snap-mirror stalled
On 06.12.22 09:54, Venky Shankar wrote: Hi Holger, On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf wrote: Hello, we have set up a snap-mirror for a directory on one of our clusters - running ceph version ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable) to get mirrorred our other cluster - running ceph version ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) The initial setup went ok, when the first snapshot was created data started to flow at a decent (for our HW) rate of 100-200MB/s. As the directory contains ~200TB this was expected to take some time - but now the process has stalled completely after ~100TB were mirrored and ~7d running. Up to now I do not have any hints why it has stopped - I do not see any error messages from the cephfs-mirror daemon. Can the small version mismatch be a problem? Any hints where to look to find out what has got stuck are welcome. I'd look at the mirror daemon logs for any errors to start with. You might want to crank up the log level for debugging (debug cephfs_mirror=20). Even on max debug I do not see anything which looks like an error - but as this is the first time I try to dig into any cephfs-mirror logs I might not notice (as long as it is not red and flashing). The Log basically this type of sequence, repeating forever: (...) cephfs::mirror::MirrorWatcher handle_notify cephfs::mirror::Mirror update_fs_mirrors cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x556fe3a7f130) after 2 seconds cephfs::mirror::Watcher handle_notify: notify_id=751516198184655, handle=93939050205568, notifier_id=25504530 cephfs::mirror::MirrorWatcher handle_notify cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run: trying to pick from 1 directories cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) pick_directory cephfs::mirror::Watcher handle_notify: notify_id=751516198184656, handle=93939050205568, notifier_id=25504530 cephfs::mirror::MirrorWatcher handle_notify cephfs::mirror::Mirror update_fs_mirrors cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x556fe3a7fc70) after 2 seconds cephfs::mirror::Watcher handle_notify: notify_id=751516198184657, handle=93939050205568, notifier_id=25504530 cephfs::mirror::MirrorWatcher handle_notify (...) Regards, Holger -- Dr. Holger Naundorf Christian-Albrechts-Universität zu Kiel Rechenzentrum / HPC / Server und Storage Tel: +49 431 880-1990 Fax: +49 431 880-1523 naund...@rz.uni-kiel.de ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Dr. Holger Naundorf Christian-Albrechts-Universität zu Kiel Rechenzentrum / HPC / Server und Storage Tel: +49 431 880-1990 Fax: +49 431 880-1523 naund...@rz.uni-kiel.de ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] pacific: ceph-mon services stopped after OSDs are out/down
Hi all, I'm running Pacific with cephadm. After installation, ceph automatically provisoned 5 ceph monitor nodes across the cluster. After a few OSDs crashed due to a hardware related issue with the SAS interface, 3 monitor services are stopped and won't restart again. Is it related to the OSD crash problem? Thanks, Mevludin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: What to expect on rejoining a host to cluster?
Hi Matt, that I'm using re-weights does not mean I would recommend it. There seems something seriously broken with reweights, see this message https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/E5BYQ27LRWFNT4M34OYKI2KM27Q3DUY6/ and the thread with it. I have to wait for a client update before considering to change that to upmaps. I'm pretty sure the script https://github.com/TheJJ/ceph-balancer linked by Stefan can be configured/tweaked to look only at the fullest and maybe the emptiest OSDs to compute a moderately sized list of re-mappings that will eliminate the outliers only. What I would not recommend is to go all balanced and 95% OSD utilisation. You will see serious performance loss after some OSDs reached 80% and if you loose an OSD or host you will have to combat the fallout of deleted upmaps. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Fwd: [MGR] Only 60 trash removal tasks are processed per minute
Hi all, Our cluster contains 12 nodes, 120 OSDs (all NVME), and - currently - 4096 PGs in total. We're currently testing a scenario of having 20 thousand - 10G - volumes and then taking snapshots of each one of them. These 20k snapshots are created in just a bit under 2 hours. When we delete one snapshot of each volume - so again 20k -, it usually takes more than 2 hours to move them to trash and create tasks to delete. Now the tasks to remove them from the trash are pretty slow. According to my calculations, it's around 1 removal in 1 second. Doing the math, it's around 5 and a half hours to empty the trash at this pace... Looking at the https://github.com/ceph/ceph/blob/main/src/pybind/mgr/rbd_support/task.py module, it's clear that this is a sequential operation, but is there anything we could do to improve the speed here? Neither the MGR nor any other components are CPU/memory bound, ceph is basically just chilling :) Any thoughts? Doma ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)
Perhaps run "iostat -xtcy 5" on the OSD hosts to see if any of the drives have weirdly high utilization despite low iops/requests? Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens : > > Hi Sven, > I am searching really hard for defect hardware, but I am currently out of > ideas: > - checked prometheus stats, but in all that data I don't know what to look > for (osd apply latency if very low at the mentioned point and went up to > 40ms after all OSDs were restarted) > - smartctl shows nothing > - dmesg show nothing > - network data shows nothing > - osd and clusterlogs show nothing > > If anybody got a good tip what I can check, that would be awesome. A string > in the logs (I made a copy from that days logs), or a tool to fire against > the hardware. I am 100% out of ideas what it could be. > In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am > waiting for the replicas to do their work" (log message 'waiting for sub > ops'). But there was no alert that any OSD had connection problems to other > OSDs. Additional the cluster_network is the same interface, switch, > everything as public_network. Only difference is the VLAN id (I plan to > remove the cluster_network because it does not provide anything for us). > > I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer > kernel, standardized OS config and so on). > > Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske >: > > > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > > > hi, > > > maybe someone here can help me to debug an issue we faced today. > > > > > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > > > reporting slow ops. > > > Only option to get it back to work fast, was to restart all OSDs daemons. > > > > > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work > > > on the cluster: synced in a node 4 days ago. > > > > > > The only health issue, that was reported, was the SLOW_OPS. No slow pings > > > on the networks. No restarting OSDs. Nothing. > > > > > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20 > > > minute timeframe around this issue. > > > > > > I haven't found any clues. > > > > > > Maybe someone encountered this in the past? > > > > do you happen to run your rocksdb on a dedicated caching device (nvme ssd)? > > > > I observed slow ops in octopus after a faulty nvme ssd was inserted in one > > ceph server. > > as was said in other mails, try to isolate your root cause. > > > > maybe the node added 4 days ago was the culprit here? > > > > we were able to pinpoint the nvme by monitoring the slow osds > > and the commonality in this case was the same nvme cache device. > > > > you should always benchmark new hardware/perform burn-in tests imho, which > > is not always possible due to environment constraints. > > > > -- > > Mit freundlichen Grüßen / Regards > > > > Sven Kieske > > Systementwickler / systems engineer > > > > > > Mittwald CM Service GmbH & Co. KG > > Königsberger Straße 4-6 > > 32339 Espelkamp > > > > Tel.: 05772 / 293-900 > > Fax: 05772 / 293-333 > > > > https://www.mittwald.de > > > > Geschäftsführer: Robert Meyer, Florian Jürgens > > > > St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen > > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen > > > > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit > > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar. > > > > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)
Hi Sven, I am searching really hard for defect hardware, but I am currently out of ideas: - checked prometheus stats, but in all that data I don't know what to look for (osd apply latency if very low at the mentioned point and went up to 40ms after all OSDs were restarted) - smartctl shows nothing - dmesg show nothing - network data shows nothing - osd and clusterlogs show nothing If anybody got a good tip what I can check, that would be awesome. A string in the logs (I made a copy from that days logs), or a tool to fire against the hardware. I am 100% out of ideas what it could be. In a time frame of 20s 2/3 of our OSDs went from "all fine" to "I am waiting for the replicas to do their work" (log message 'waiting for sub ops'). But there was no alert that any OSD had connection problems to other OSDs. Additional the cluster_network is the same interface, switch, everything as public_network. Only difference is the VLAN id (I plan to remove the cluster_network because it does not provide anything for us). I am also planning to update all hosts from centos7 to ubuntu 20.04 (newer kernel, standardized OS config and so on). Am Mo., 5. Dez. 2022 um 14:24 Uhr schrieb Sven Kieske : > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > > hi, > > maybe someone here can help me to debug an issue we faced today. > > > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > > reporting slow ops. > > Only option to get it back to work fast, was to restart all OSDs daemons. > > > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work > > on the cluster: synced in a node 4 days ago. > > > > The only health issue, that was reported, was the SLOW_OPS. No slow pings > > on the networks. No restarting OSDs. Nothing. > > > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20 > > minute timeframe around this issue. > > > > I haven't found any clues. > > > > Maybe someone encountered this in the past? > > do you happen to run your rocksdb on a dedicated caching device (nvme ssd)? > > I observed slow ops in octopus after a faulty nvme ssd was inserted in one > ceph server. > as was said in other mails, try to isolate your root cause. > > maybe the node added 4 days ago was the culprit here? > > we were able to pinpoint the nvme by monitoring the slow osds > and the commonality in this case was the same nvme cache device. > > you should always benchmark new hardware/perform burn-in tests imho, which > is not always possible due to environment constraints. > > -- > Mit freundlichen Grüßen / Regards > > Sven Kieske > Systementwickler / systems engineer > > > Mittwald CM Service GmbH & Co. KG > Königsberger Straße 4-6 > 32339 Espelkamp > > Tel.: 05772 / 293-900 > Fax: 05772 / 293-333 > > https://www.mittwald.de > > Geschäftsführer: Robert Meyer, Florian Jürgens > > St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen > > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds abrufbar. > > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs snap-mirror stalled
Hi Holger, On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf wrote: > > Hello, > we have set up a snap-mirror for a directory on one of our clusters - > running ceph version > > ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > (stable) > > to get mirrorred our other cluster - running ceph version > > ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific > (stable) > > The initial setup went ok, when the first snapshot was created data > started to flow at a decent (for our HW) rate of 100-200MB/s. As the > directory contains ~200TB this was expected to take some time - but now > the process has stalled completely after ~100TB were mirrored and ~7d > running. > > Up to now I do not have any hints why it has stopped - I do not see any > error messages from the cephfs-mirror daemon. Can the small version > mismatch be a problem? > > Any hints where to look to find out what has got stuck are welcome. I'd look at the mirror daemon logs for any errors to start with. You might want to crank up the log level for debugging (debug cephfs_mirror=20). > > Regards, > Holger > > -- > Dr. Holger Naundorf > Christian-Albrechts-Universität zu Kiel > Rechenzentrum / HPC / Server und Storage > Tel: +49 431 880-1990 > Fax: +49 431 880-1523 > naund...@rz.uni-kiel.de > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Cheers, Venky ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs snap-mirror stalled
Hello, we have set up a snap-mirror for a directory on one of our clusters - running ceph version ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable) to get mirrorred our other cluster - running ceph version ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) The initial setup went ok, when the first snapshot was created data started to flow at a decent (for our HW) rate of 100-200MB/s. As the directory contains ~200TB this was expected to take some time - but now the process has stalled completely after ~100TB were mirrored and ~7d running. Up to now I do not have any hints why it has stopped - I do not see any error messages from the cephfs-mirror daemon. Can the small version mismatch be a problem? Any hints where to look to find out what has got stuck are welcome. Regards, Holger -- Dr. Holger Naundorf Christian-Albrechts-Universität zu Kiel Rechenzentrum / HPC / Server und Storage Tel: +49 431 880-1990 Fax: +49 431 880-1523 naund...@rz.uni-kiel.de ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io