[ceph-users] MDS stuck in replay and continually crashing during replay
compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_t 0/ 5 seastore_cleaner 0/ 5 seastore_epm 0/ 5 seastore_lba 0/ 5 seastore_fixedkv_tree 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 seastore_backref 0/ 5 alienstore 1/ 5 mclock 0/ 5 cyanstore 1/ 5 ceph_exporter 1/ 5 memstore -2/-2 (syslog threshold) -1/-1 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7fa8b6d95640 / md_log_replay max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mds.pebbles-s3.log --- end dump of recent events --- Our MDS then starts at the beginning of the replay process and continually re-replays the journal until it crashes again at the same point. It looks like our journal has gotten corrupted at this file from what I understand and our journal (worryingly) is exceptionally large where we've had to use a 2 TiB machine just to try and export it. What is causing this issue? Can we do small modifications to the journal or similar to rectify this issue or move the faulty object in the journal out of the bulk object-store to fail (and thus skip) the transaction? We really do not want to go through disaster recovery again (https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/#disaster-recovery-experts) as this is the 2nd time this has happened to this cluster in the last 4 months and it took over a month to recover the data last time Kindest regards, Ivan -- Ivan Clayson - Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Tim, Alma8's active support ended in May this year and henceforth there are only security updates. But you make a good point and we are moving toward Alma9 very shortly! Whilst we're mentioning distributions, we've had quite a good experience with Alma (notwithstanding our current but unrelated troubles) and we would recommend it. Kindest regards, Ivan On 09/07/2024 16:19, Tim Holloway wrote: CAUTION: This email originated from outside of the LMB: .-t...@mousetech.com-. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Ivan, This may be a little off-topic, but if you're still running AlmaLinux 8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2 years ago, thanks to CentOS Stream. Up until this last week, however, I had several AlmaLinux 8 machines running myself, but apparently somewhere around May IBM Red Hat pulled all of its CentOS8 enterprise sites offline, including Storage and Ceph, which broke my yum updates. While as far as I'm aware, once you've installed cephadm (whether via yum/dnf or otherwise), there's no further need for the RPM repos, losing yum support is not helping at the very least. On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to AlmaLinux 9, although it may require temporarily disabling certain OS services to appease the upgrade process. Probably won't solve your problem, but at least you'll be able to move fairly painlessly to a better-supported platform. Best Regards, Tim On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote: Hi Dhairya, I would be more than happy to try and give as many details as possible but the slack channel is private and requires my email to have an account/ access to it. Wouldn't taking the discussion about this error to a private channel also stop other users who experience this error from learning about how and why this happened as well as possibly not be able to view the solution? Would it not be possible to discuss this more publicly for the benefit of the other users on the mailing list? Kindest regards, Ivan On 09/07/2024 10:44, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Hey Ivan, This is a relatively new MDS crash, so this would require some investigation but I was instructed to recommend disaster-recovery steps [0] (except session reset) to you to get the FS up again. This crash is being discussed on upstream CephFS slack channel [1] with @Venky Shankar <mailto:vshan...@redhat.com> and other CephFS devs. I'd encourage you to join the conversation, we can discuss this in detail and maybe go through the incident step by step which should help analyse the crash better. [0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts [1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519 On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson wrote: Hi Dhairya, Thank you ever so much for having another look at this so quickly. I don't think I have any logs similar to the ones you referenced this time as my MDSs don't seem to enter the replay stage when they crash (or at least don't now after I've thrown the logs away) but those errors do crop up in the prior logs I shared when the system first crashed. Kindest regards, Ivan On 08/07/2024 14:08, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson wrote: Hi Dhairya, Sorry to resurrect this thread again, but we still unfortunately have an issue with our filesystem after we attempted to write new backups to it. We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Dhairya, I would be more than happy to try and give as many details as possible but the slack channel is private and requires my email to have an account/ access to it. Wouldn't taking the discussion about this error to a private channel also stop other users who experience this error from learning about how and why this happened as well as possibly not be able to view the solution? Would it not be possible to discuss this more publicly for the benefit of the other users on the mailing list? Kindest regards, Ivan On 09/07/2024 10:44, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Hey Ivan, This is a relatively new MDS crash, so this would require some investigation but I was instructed to recommend disaster-recovery steps [0] (except session reset) to you to get the FS up again. This crash is being discussed on upstream CephFS slack channel [1] with @Venky Shankar <mailto:vshan...@redhat.com> and other CephFS devs. I'd encourage you to join the conversation, we can discuss this in detail and maybe go through the incident step by step which should help analyse the crash better. [0] https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts [1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519 On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson wrote: Hi Dhairya, Thank you ever so much for having another look at this so quickly. I don't think I have any logs similar to the ones you referenced this time as my MDSs don't seem to enter the replay stage when they crash (or at least don't now after I've thrown the logs away) but those errors do crop up in the prior logs I shared when the system first crashed. Kindest regards, Ivan On 08/07/2024 14:08, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson wrote: Hi Dhairya, Sorry to resurrect this thread again, but we still unfortunately have an issue with our filesystem after we attempted to write new backups to it. We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again. We re-started the backups to run over the weekend and unfortunately the filesystem crashed again where the log of the failure is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. We ran the backups on kernel mounts of the filesystem without the nowsync option this time to avoid the out-of-sync write problems.. I've tried resetting the journal again after recovering the dentries but unfortunately the filesystem is still in a failed state despite setting joinable to true. The log of this crash is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708. I'm not sure how to proceed as I can't seem to get any MDS to take over the first rank. I would like to do a scrub of the filesystem and preferably overwrite the troublesome files with the originals on the live filesystem. Do you have any advice on how to make the filesystem leave its failed state? I have a backup of the journal before I reset it so I can roll back if necessary. Here are some details about the filesystem at present: root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status cluster: id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 health: HEALTH_ERR 1 filesystem is degraded 1 large omap objects 1 filesystem is offline 1 mds daemon damaged nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosna
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Dhairya, Thank you ever so much for having another look at this so quickly. I don't think I have any logs similar to the ones you referenced this time as my MDSs don't seem to enter the replay stage when they crash (or at least don't now after I've thrown the logs away) but those errors do crop up in the prior logs I shared when the system first crashed. Kindest regards, Ivan On 08/07/2024 14:08, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Ugh, something went horribly wrong. I've downloaded the MDS logs that contain assertion failure and it looks relevant to this [0]. Do you have client logs for this? The other log that you shared is being downloaded right now, once that's done and I'm done going through it, I'll update you. [0] https://tracker.ceph.com/issues/54546 On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson wrote: Hi Dhairya, Sorry to resurrect this thread again, but we still unfortunately have an issue with our filesystem after we attempted to write new backups to it. We finished the scrub of the filesystem on Friday and ran a repair scrub on the 1 directory which had metadata damage. After doing so and rebooting, the cluster reported no issues and data was accessible again. We re-started the backups to run over the weekend and unfortunately the filesystem crashed again where the log of the failure is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. We ran the backups on kernel mounts of the filesystem without the nowsync option this time to avoid the out-of-sync write problems.. I've tried resetting the journal again after recovering the dentries but unfortunately the filesystem is still in a failed state despite setting joinable to true. The log of this crash is here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708. I'm not sure how to proceed as I can't seem to get any MDS to take over the first rank. I would like to do a scrub of the filesystem and preferably overwrite the troublesome files with the originals on the live filesystem. Do you have any advice on how to make the filesystem leave its failed state? I have a backup of the journal before I reset it so I can roll back if necessary. Here are some details about the filesystem at present: root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status cluster: id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 health: HEALTH_ERR 1 filesystem is degraded 1 large omap objects 1 filesystem is offline 1 mds daemon damaged nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set 1750 pgs not deep-scrubbed in time 1612 pgs not scrubbed in time services: mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m) mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3, pebbles-s4 mds: 1/2 daemons up, 3 standby osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped pgs flags nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim data: volumes: 1/2 healthy, 1 recovering; 1 damaged pools: 7 pools, 2177 pgs objects: 3.24G objects, 6.7 PiB usage: 8.6 PiB used, 14 PiB / 23 PiB avail pgs: 11785954/27384310061 objects misplaced (0.043%) 2167 active+clean 6 active+remapped+backfilling 4 active+remapped+backfill_wait ceph_backup - 0 clients === RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 failed POOL TYPE USED AVAIL mds_backup_fs metadata 1174G 3071G ec82_primary_fs_data data 0 3071G ec82pool data 8085T 4738T ceph_archive - 2 clients RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active pebbles-s4 Reqs: 0 /s 13.4k 7105 118 2 POOL TYPE USED AVAIL mds_archive_fs metadata 5184M 3071G ec83_primary_fs_data data 0 3071G ec83pool data 138T 4307T STANDBY MDS pebbles-s2 pebbles-s3
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
024 15:17, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson wrote: Hi Dhairya, I would be more than happy to share our corrupted journal. Has the host key changed for drop.ceph.com <http://drop.ceph.com>? The fingerprint I'm being sent is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one in our /usr/share/ceph/known_hosts_drop.ceph.com <http://known_hosts_drop.ceph.com>. Ah, strange. Let me get in touch with folks who might know about this, will revert back to you ASAP Thank you for your advice as well. We've reset our MDS' journal and are currently in the process of a full filesystem scrub which understandably is taking quite a bit of time but seems to be progressing through the objects fine. YAY! Thank you ever so much for all your help and please do feel free to follow up with us if you would like any further details about our crash! Glad to hear it went well, this bug is being worked on with high priority and once the patch is ready, it will be backported. The root cause of this issue is the `nowsync` (async dirops) being enabled by default with kclient [0]. This feature allows asynchronous creation and deletion of files, optimizing performance by avoiding round-trip latency for these system calls. However, in very rare cases (like yours :D), it can affect the system's consistency and stability hence if this kind of optimization is not a priority for your workload, I recommend turning it off by switching the mount points to `wsync` and also set the MDS config `mds_client_delegate_inos_pct` to `0` so that you don't end up in this situation again (until the bug fix arrives :)). [0] https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902 Kindest regards, Ivan On 27/06/2024 12:39, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Hi Ivan, The solution (which has been successful for us in the past) is to reset the journal. This would bring the fs back online and return the MDSes to a stable state, but some data would be lost—the data in the journal that hasn't been flushed to the backing store would be gone. Therefore, you should try to flush out as much journal data as possible before resetting the journal. Here are the steps for this entire process: 1) Bring the FS offline $ ceph fs fail 2) Recover dentries from journal (run it with every MDS Rank) $ cephfs-journal-tool --rank=: event recover_dentries summary 3) Reset the journal (again with every MDS Rank) $ cephfs-journal-tool --rank=: journal reset 4) Bring the FS online $ cephfs fs set joinable true 5) Restart the MDSes 6) Perform scrub to ensure consistency of fs $ ceph tell mds.:0 scrub start [scrubopts] [tag] # you could try a recursive scrub maybe `ceph tell mds.:0 scrub start / recursive` Some important notes to keep in mind: * Recovering dentries will take time (generally, rank 0 is the most time-consuming, but the rest should be quick). * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU percentage. This is because cephfs-journal-tool has to swig the journal data and flush it out to the backing store, which also makes the metadata operations go rampant, resulting in OSDs taking a significant percentage of CPU. Do let me know how this goes. On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson wrote: Hi Dhairya, We can induce the crash by simply restarting the MDS and the crash seems to happen when an MDS goes from up:standby to up:replay. The MDS works through a few files in the log before eventually crashing where I've included the logs for this here (this is after I imported the backed up journal which I hope was successful but please let me know if you suspect it wasn't!): https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log With respect to the client logs, are you referring to the clients who are writing to the filesystem? We don't typically run them in any sort of debug mode and we have quite a few machines running our backup system but we can look an hour or
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Dhairya, I would be more than happy to share our corrupted journal. Has the host key changed for drop.ceph.com? The fingerprint I'm being sent is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one in our /usr/share/ceph/known_hosts_drop.ceph.com. Thank you for your advice as well. We've reset our MDS' journal and are currently in the process of a full filesystem scrub which understandably is taking quite a bit of time but seems to be progressing through the objects fine. Thank you ever so much for all your help and please do feel free to follow up with us if you would like any further details about our crash! Kindest regards, Ivan On 27/06/2024 12:39, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Hi Ivan, The solution (which has been successful for us in the past) is to reset the journal. This would bring the fs back online and return the MDSes to a stable state, but some data would be lost—the data in the journal that hasn't been flushed to the backing store would be gone. Therefore, you should try to flush out as much journal data as possible before resetting the journal. Here are the steps for this entire process: 1) Bring the FS offline $ ceph fs fail 2) Recover dentries from journal (run it with every MDS Rank) $ cephfs-journal-tool --rank=: event recover_dentries summary 3) Reset the journal (again with every MDS Rank) $ cephfs-journal-tool --rank=: journal reset 4) Bring the FS online $ cephfs fs set joinable true 5) Restart the MDSes 6) Perform scrub to ensure consistency of fs $ ceph tell mds.:0 scrub start [scrubopts] [tag] # you could try a recursive scrub maybe `ceph tell mds.:0 scrub start / recursive` Some important notes to keep in mind: * Recovering dentries will take time (generally, rank 0 is the most time-consuming, but the rest should be quick). * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU percentage. This is because cephfs-journal-tool has to swig the journal data and flush it out to the backing store, which also makes the metadata operations go rampant, resulting in OSDs taking a significant percentage of CPU. Do let me know how this goes. On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson wrote: Hi Dhairya, We can induce the crash by simply restarting the MDS and the crash seems to happen when an MDS goes from up:standby to up:replay. The MDS works through a few files in the log before eventually crashing where I've included the logs for this here (this is after I imported the backed up journal which I hope was successful but please let me know if you suspect it wasn't!): https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log With respect to the client logs, are you referring to the clients who are writing to the filesystem? We don't typically run them in any sort of debug mode and we have quite a few machines running our backup system but we can look an hour or so before the first MDS crash (though I don't know if this is when the de-sync occurred). Here are some MDS logs with regards to the initial crash on Saturday morning though which may be helpful: -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-22T05:41:13.091556+0100) -57> 2024-06-22T05:41:43.208+0100 7f184de84700 1 mds.pebbles-s2 Updating MDS map to version 2529650 from mon.3 -56> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue operator(): data pool 6 not found in OSDMap -55> 2024-06-22T05:41:43.208+0100 7f184de84700 4 mds.0.purge_queue operator(): data pool 3 not found in OSDMap -54> 2024-06-22T05:41:43.209+0100 7f184de84700 5 asok(0x5592e7968000) register_command objecter_requests hook 0x5592e78f8800 -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: _renew_subs -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0> -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) -50> 2024-06-22T05:41:43.209+0100 7f184de84700 4 mds.0.purge_queue operator
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
) [0x7f18568b6669] 6: (interval_set::erase(inodeno_t, inodeno_t, std::function)+0x2e5) [0x5592e5027885] 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7] 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1] 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b] 10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041] 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca] 12: clone() We have a relatively low debug setting normally so I don't think many details of the initial crash were captured unfortunately and the MDS logs before the above (i.e. "-60" and older) are just beacon messages and _check_auth_rotating checks. I was wondering whether you have any recommendations in terms of what actions we could take to bring our filesystem back into a working state short of rebuilding the entire metadata pool? We are quite keen to bring our backup back into service urgently as we currently do not have any accessible backups for our Ceph clusters. Kindest regards, Ivan On 25/06/2024 19:18, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson wrote: Hi Dhairya, Thank you for your rapid reply. I tried recovering the dentries for the file just before the crash I mentioned before and then splicing the transactions from the journal which seemed to remove that issue for that inode but resulted in the MDS crashing on the next inode in the journal when performing replay. The MDS delegates a range of preallocated inodes (in form of a set - interval_set preallocated_inos) to the clients, so it can be one inode that is untracked or some inodes from the range or in worst case scenario - ALL, and this is something that even the `cephfs-journal-tool` would not be able to tell (since we're talking about MDS internals which aren't exposed to such tools). That is the reason why you see "MDS crashing on the next inode in the journal when performing replay". An option could be to expose the inode set to some tool or asok cmd to identify such inodes ranges, which needs to be discussed. For now, we're trying to address this in [0], you can follow the discussion there. [0] https://tracker.ceph.com/issues/66251 Removing all the transactions involving the directory housing the files that seemed to cause these crashes from the journal only caused the MDS to fail to even start replay. I've rolled back our journal to our original version when the crash first happened and the entire MDS log for the crash can be found here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24 Awesome, this would help us a ton. Apart from this, would it be possible to send us client logs? Please let us know if you would like any other logs file as we can easily induce this crash. Since you can easily induce the crash, can you share the reproducer please i.e. what all action you take in order to hit this? Kindest regards, Ivan On 25/06/2024 09:58, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Hi Ivan, This looks to be similar to the issue [0] that we're already addressing at [1]. So basically there is some out-of-sync event that led the client to make use of the inodes that MDS wasn't aware of/isn't tracking and hence the crash. It'd be really helpful if you can provide us more logs. CC @Rishabh Dave <mailto:rid...@redhat.com> @Venky Shankar <mailto:vshan...@redhat.com> @Patrick Donnelly <mailto:pdonn...@redhat.com> @Xiubo Li <mailto:xiu...@redhat.com> [0] https://tracker.ceph.com/issues/61009 [1] https://tracker.ceph.com/issues/66251 -- ***Dhairya Parmar* Associate Software Engineer, CephFS <https://www.redhat.com/>IBM, Inc. On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson wrote: Hello, We have been experiencing a serious issue with our CephFS backup cluster running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are constantly in a "replay" or "replay(laggy)" state and keep crashing. We have a single MDS filesyst
[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards
Hi Dhairya, Thank you for your rapid reply. I tried recovering the dentries for the file just before the crash I mentioned before and then splicing the transactions from the journal which seemed to remove that issue for that inode but resulted in the MDS crashing on the next inode in the journal when performing replay. Removing all the transactions involving the directory housing the files that seemed to cause these crashes from the journal only caused the MDS to fail to even start replay. I've rolled back our journal to our original version when the crash first happened and the entire MDS log for the crash can be found here: https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24 Please let us know if you would like any other logs file as we can easily induce this crash. Kindest regards, Ivan On 25/06/2024 09:58, Dhairya Parmar wrote: CAUTION: This email originated from outside of the LMB: *.-dpar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- Hi Ivan, This looks to be similar to the issue [0] that we're already addressing at [1]. So basically there is some out-of-sync event that led the client to make use of the inodes that MDS wasn't aware of/isn't tracking and hence the crash. It'd be really helpful if you can provide us more logs. CC @Rishabh Dave <mailto:rid...@redhat.com> @Venky Shankar <mailto:vshan...@redhat.com> @Patrick Donnelly <mailto:pdonn...@redhat.com> @Xiubo Li <mailto:xiu...@redhat.com> [0] https://tracker.ceph.com/issues/61009 [1] https://tracker.ceph.com/issues/66251 -- ***Dhairya Parmar* Associate Software Engineer, CephFS <https://www.redhat.com/>IBM, Inc. On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson wrote: Hello, We have been experiencing a serious issue with our CephFS backup cluster running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are constantly in a "replay" or "replay(laggy)" state and keep crashing. We have a single MDS filesystem called "ceph_backup" with 2 standby MDSes along with a 2nd unused filesystem "ceph_archive" (this holds little to no data) where we are using our "ceph_backup" filesystem to backup our data and this is the one which is currently broken. The Ceph health outputs currently are: root@pebbles-s1 14:05 [~]: ceph -s cluster: id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631 health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available 1319 pgs not deep-scrubbed in time 1054 pgs not scrubbed in time services: mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m) mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4, pebbles-s3, pebbles-s1 mds: 2/2 daemons up osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37 remapped pgs data: volumes: 1/2 healthy, 1 recovering pools: 7 pools, 2177 pgs objects: 3.55G objects, 7.0 PiB usage: 8.9 PiB used, 14 PiB / 23 PiB avail pgs: 83133528/30006841533 objects misplaced (0.277%) 2090 active+clean 47 active+clean+scrubbing+deep 29 active+remapped+backfilling 8 active+remapped+backfill_wait 2 active+clean+scrubbing 1 active+clean+snaptrim io: recovery: 1.9 GiB/s, 719 objects/s root@pebbles-s1 14:09 [~]: ceph fs status ceph_backup - 0 clients === RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay(laggy) pebbles-s3 0 0 0 0 POOL TYPE USED AVAIL mds_backup_fs metadata 1255G 2780G ec82_primary_fs_data data 0 2780G ec82pool data 8442T 3044T ceph_archive - 2 clients RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active pebbles-s2 Reqs: 0 /s 13.4k 7105 118 2 POOL TYPE USED AVAIL mds_archive_fs metadata 5184M 2780G ec83_primary_fs_data data 0 2780G ec83pool data 138T 2767T MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c
[ceph-users] CephFS MDS crashing during replay with standby MDSes crashing afterwards
quot;) where failing that we could erase this problematic event with "cephfs-journal-tool --rank=ceph_backup:0 event splice --inode 1101069090357". Is this a good idea? We would rather not rebuild the entire metadata pool if we could avoid it (once was enough for us) as this cluster has ~9 PB of data on it. Kindest regards, Ivan Clayson -- Ivan Clayson - Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine
Hello Gregory and Nathan, Having a look at our resource utilization, there doesn't seem to be a CPU or memory bottleneck as there is plenty of both available for the host which has the blocked OSD as well for the MDS' host. We've had a repeated of this problem today where the OSD logging slow ops did not have any ops in flight despite: (i) being in an active state, (ii) clients requesting I/O from this OSD, and (iii) the MDS reporting that it was unable to get a rdlock. The blocked op reported by the MDS was initially related to our backups (but is not always) where this takes a snapshot every night and then we back up the snapshot to another Ceph cluster. We then delete this snapshot after we've backed it up. # the blocked op on the MDS is related to backing up a snapshot of the file $FILE: ~$ ceph tell mds.0 dump_blocked_ops "description": "client_request(client.90018803:265050 getattr AsLsXsFs #0x14e2452//1710815659/... ... caller_uid=..., caller_gid=...)" "initiated_at": "...", "age": ..., "duration": ..., "type_data": { "flag_point": "failed to rdlock, waiting", ... ~$ ls -lrt $FILE # ls -lrt hangs as it hangs on a statx syscall on the file where this then comes up as another blocked op in the MDS op list ~$ ceph tell mds.0 dump_blocked_ops client_request(client.91265572:7 getattr AsLsFs #0x1002fe5d755 ... caller_uid=..., caller_gid=...) root@client-whose-held-active-cap-for-1002fe5d755-the-longest ~$ grep 1002fe5d755 /sys/kernel/debug/ceph/*/osdc 1652 osd7 3.3519a4ff 3.4ffs0 [7,132,61,143,109,98,18,44,269,238]/7 [7,132,61,143,109,98,18,44,269,238]/7 e159072 1002fe5d755.0011 0x400024 1 write ~$ systemctl status --no-pager --full ceph-osd@7 ceph-osd[1184036]: osd.7 158945 get_health_metrics reporting 8 slow ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0 3:ff28a5a4:::1002feddfaa.:head [write 0~4194304 [1@-1] in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942) ceph-osd[1184036]: osd.7 158945 get_health_metrics reporting 6 slow ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0 3:ff28a5a4:::1002feddfaa.:head [write 0~4194304 [1@-1] in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942) There was nothing in dmesg or wrong with the HDD for osd.7 (or any drives for that matter) and osd.7 reported no blocked ops or any ops in flight from the daemon via `ceph tell`. However when looking at the historic slow ops, the oldest one still saved related to this stuck $FILE object (1002fe5d755) and it seems that about half of the recorded historical slow ops are about this PG with them all occurring around the same time the OSD slow ops started occurring: ~$ ceph tell osd.7 dump_historic_slow_ops "description": "osd_op(client.89624569.0:1151567 3.4ffs0 3:ff27ac5d:::1002fea2a90.000a:head [write 0~4194304] snapc 6ab9=[6ab9] ondisk+write+known_if_redirected e158919)", "initiated_at": "...", "age": ..., "duration": ..., "type_data": { "flag_point": "commit sent; apply or cleanup", ... { "event": "header_read", "time": "2024-03-18T19:43:41.875596+", "duration": 4294967295.967 }, I've highlighted this header_read duration as this took apparently ~136 years(!) so there seems to be something off maybe with the messenger layer. I would be eager to hear what your thoughts are on this as it seems after awhile the OSD "forgets" about this slow op and stops reporting it in the log. I'm also curious about your thoughts on this being related to the number of snapshots we have as we get rid of the snapshot on this filesystem when we've copied over to the backup system but could this still cause problems and or are they issues with snaps? Kindest regards, Ivan On 15/03/2024 18:07, Gregory Farnum wrote: CAUTION: This email originated from outside of the LMB: *.-gfar...@redhat.com-.* Do not click links or open attachments unless you recognize the sender and know the content is safe. If you think this is a phishing email, please forward it to phish...@mrc-lmb.cam.ac.uk -- On Fri, Mar 15, 2024 at 6:15 AM Ivan Clayson wrote: Hello everyone, We've been experiencing on our quincy CephFS clusters (one 17.2.6 and another 17.2.7) repeated sl
[ceph-users] MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine
Hello everyone, We've been experiencing on our quincy CephFS clusters (one 17.2.6 and another 17.2.7) repeated slow ops with our client kernel mounts (Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to originate from slow ops on osds despite the underlying hardware being fine. Our 2 clusters are similar and are both Alma8 systems where more specifically: * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD OSDs storing the metadata and 432 spinning SATA disks storing the bulk data in an EC pool (8 data shards and 2 parity blocks) across 40 nodes. The whole cluster is used to support a single file system with 1 active MDS and 2 standby ones. * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD OSDs storing the metadata and 348 spinning SAS disks storing the bulk data in EC pools (8 data shards and 2 parity blocks). This cluster houses multiple filesystems each with their own dedicated MDS along with 3 communal standby ones. Nearly daily we often find that we're get the following error messages: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST. Along with these messages, certain files and directory cannot be stat-ed and any processes involving these files hang indefinitely. We have been fixing this by: 1. First, finding the oldest blocked MDS op and the inode listed there: ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep -c description "description": "client_request(client.251247219:662 getattr AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+ caller_uid=26983, caller_gid=26983)", # inode/ object of interest: 100922d1102 2. Second, finding all the current clients that have a cap for this blocked inode from the faulty MDS' session list (i.e. ceph tell mds.${my_mds} session ls --cap-dump) and then examining the client whose had the cap the longest: ~$ ceph tell mds.${my_mds} session ls --cap-dump ... 2024-03-13T13:01:36: client.251247219 2024-03-13T12:50:28: client.245466949 3. Then on the aforementioned oldest client, get the current ops in flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files) and get the op corresponding to the blocked inode along with the OSD the I/O is going to: root@client245466949 $ grep 100922d1102 /sys/kernel/debug/ceph/*/osdc 48366 osd79 2.249f8a51 2.a51s0 [79,351,232,179,107,195,323,14,128,167]/79 [79,351,232,179,107,195,323,14,128,167]/79 e374191 100922d1102.00f5 0x400024 1 write # osd causing errors is osd.79 4. Finally, we restart this "hanging" OSD where this results in ls and I/O on the previously "stuck" files no longer "hanging" . Once we get this OSD for which the blocked inode is waiting for, we can see in the system logs that the OSD has slow ops: ~$ systemctl --no-pager --full status ceph-osd@79 ... 2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0 2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173) ... Files that these "hanging" inodes correspond to as well as the directories housing these files can't be opened or stat-ed (causing directories to hang) where we've found restarting this OSD with slow ops to be the least disruptive way of resolving this (compared with a forced umount and then re-mount on the client). There are no issues with the underlying hardware for either the osd reporting these slow ops or any other drive within the acting PG and there seems to be no correlation between what processes are involved or what type of files these are. What could be causing these slow ops and certain files and directories to "hang"? There aren't workflows being performed that generate a large number of small files nor are there directories with a large number of files within them. This seems to happen with a wide range of hard-drives and we see this on SATA and SAS type drives where our nodes are interconnected with 25 Gb/s NICs so we can't see how the underlying hardware would be causing any I/O bottlenecks. Has anyone else seen this type of behaviour before and have any ideas? Is there a way to stop these from happening as we are having to solve these nearly daily now and we can't seem to find a way to reduce them. We do use snapshots to backup our cluster where we've been doing this for ~6 months and these issues have only been occurring on and off for a couple of months but much more frequently now. Kindest regards, Ivan Clayson -- Ivan Clayson - Scientific Computing Officer Room 2N249 Structural Studies MRC Laboratory of Molecular Biology
[ceph-users] Re: Clients failing to respond to capability release
quot; which was similarly tackled by restarting the MDS that just took over. This finally resulted in only two clients failing to respond to caps releases on inodes they were holding (despite rebooting at the time) where performing a "ceph tell mds.N session kill CLIENT_ID" removed them from the session map and allow the MDS' cache to become manageable again, thereby clearing all of these warning messages. We've had this problem since the beginning of this year and upgrading from octopus to quincy has unfortunately not solved our problem. We've only really been able to solve this problem by undergoing an aggressive campaign of replacing hard-drives which were reaching the end of their lives. This has substantially reduced the amount of problems we've had in relation to this. We would be very interested to hear about the rest of the community's experience in relation to this and I would recommend looking at your underlying OSDs Tim to see whether there are any timeout or uncorrectable errors. We would also be very eager to hear if these approaches are sub-optimal and whether anyone else has any insight into our problems. Sorry as well for resurrecting an old thread but we thought our experiences may be helpfully for others! Kindest regards, Ivan Clayson On 19/09/2023 12:35, Tim Bishop wrote: Hi, I've seen this issue mentioned in the past, but with older releases. So I'm wondering if anybody has any pointers. The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all clients are working fine, with the exception of our backup server. This is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1] (so I suspect a newer Ceph version?). The backup server has multiple (12) CephFS mount points. One of them, the busiest, regularly causes this error on the cluster: HEALTH_WARN 1 clients failing to respond to capability release [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to respond to capability release client_id: 521306112 And occasionally, which may be unrelated, but occurs at the same time: [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs The second one clears itself, but the first sticks until I can unmount the filesystem on the client after the backup completes. It appears that whilst it's in this stuck state there may be one or more directory trees that are inaccessible to all clients. The backup server is walking the whole tree but never gets stuck itself, so either the inaccessible directory entry is caused after it has gone past, or it's not affected. Maybe the backup server is holding a directory when it shouldn't? It may be that an upgrade to Quincy resolves this, since it's more likely to be inline with the kernel client version wise, but I don't want to knee-jerk upgrade just to try and fix this problem. Thanks for any advice. Tim. [1] The reason for the newer kernel is that the backup performance from CephFS was terrible with older kernels. This newer kernel does at least resolve that issue. ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an email toceph-users-le...@ceph.io -- CAUTION: This email originated from outside of the LMB. Do not click links or open attachments unless you recognize the sender and know the content is safe. .-ceph-users-boun...@ceph.io-. -- Ivan Clayson - Scientific Computing Officer MRC Laboratory of Molecular Biology Francis Crick Ave, Cambridge CB2 0QH ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io