[ceph-users] MDS stuck in replay and continually crashing during replay

2024-10-03 Thread Ivan Clayson
 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
   --- pthread ID / name mapping for recent threads ---
  7fa8b6d95640 / md_log_replay
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mds.pebbles-s3.log
   --- end dump of recent events ---

Our MDS then starts at the beginning of the replay process and 
continually re-replays the journal until it crashes again at the same point.


It looks like our journal has gotten corrupted at this file from what I 
understand and our journal (worryingly) is exceptionally large where 
we've had to use a 2 TiB machine just to try and export it. What is 
causing this issue? Can we do small modifications to the journal or 
similar to rectify this issue or move the faulty object in the journal 
out of the bulk object-store to fail (and thus skip) the transaction? We 
really do not want to go through disaster recovery again 
(https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/#disaster-recovery-experts) 
as this is the 2nd time this has happened to this cluster in the last 4 
months and it took over a month to recover the data last time


Kindest regards,

Ivan

--
Ivan Clayson
-
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-10 Thread Ivan Clayson

Hi Tim,

Alma8's active support ended in May this year and henceforth there are 
only security updates. But you make a good point and we are moving 
toward Alma9 very shortly!


Whilst we're mentioning distributions, we've had quite a good experience 
with Alma (notwithstanding our current but unrelated troubles) and we 
would recommend it.


Kindest regards,

Ivan

On 09/07/2024 16:19, Tim Holloway wrote:

CAUTION: This email originated from outside of the LMB:
.-t...@mousetech.com-.
Do not click links or open attachments unless you recognize the sender and know 
the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk
--

Ivan,

This may be a little off-topic, but if you're still running AlmaLinux
8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2
years ago, thanks to CentOS Stream.

Up until this last week, however, I had several AlmaLinux 8 machines
running myself, but apparently somewhere around May IBM Red Hat pulled
all of its CentOS8 enterprise sites offline, including Storage and
Ceph, which broke my yum updates.

While as far as I'm aware, once you've installed cephadm (whether via
yum/dnf or otherwise), there's no further need for the RPM repos,
losing yum support is not helping at the very least.

On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to
AlmaLinux 9, although it may require temporarily disabling certain OS
services to appease the upgrade process.

Probably won't solve your problem, but at least you'll be able to move
fairly painlessly to a better-supported platform.

   Best Regards,
      Tim

On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote:

Hi Dhairya,

I would be more than happy to try and give as many details as
possible
but the slack channel is private and requires my email to have an
account/ access to it.

Wouldn't taking the discussion about this error to a private channel
also stop other users who experience this error from learning about
how
and why this happened as  well as possibly not be able to view the
solution? Would it not be possible to discuss this more publicly for
the
benefit of the other users on the mailing list?

Kindest regards,

Ivan

On 09/07/2024 10:44, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the
sender
and know the content is safe.
If you think this is a phishing email, please forward it to
phish...@mrc-lmb.cam.ac.uk


--

Hey Ivan,

This is a relatively new MDS crash, so this would require some
investigation but I was instructed to recommend disaster-recovery
steps [0] (except session reset) to you to get the FS up again.

This crash is being discussed on upstream CephFS slack channel [1]
with @Venky Shankar <mailto:vshan...@redhat.com> and other CephFS
devs. I'd encourage you to join the conversation, we can discuss
this
in detail and maybe go through the incident step by step which
should
help analyse the crash better.

[0]
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
[1]
https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson

wrote:

     Hi Dhairya,

     Thank you ever so much for having another look at this so
quickly.
     I don't think I have any logs similar to the ones you
referenced
     this time as my MDSs don't seem to enter the replay stage when
     they crash (or at least don't now after I've thrown the logs
away)
     but those errors do crop up in the prior logs I shared when the
     system first crashed.

     Kindest regards,

     Ivan

     On 08/07/2024 14:08, Dhairya Parmar wrote:

     CAUTION: This email originated from outside of the LMB:
     *.-dpar...@redhat.com-.*
     Do not click links or open attachments unless you recognize
the
     sender and know the content is safe.
     If you think this is a phishing email, please forward it to
     phish...@mrc-lmb.cam.ac.uk


     --

     Ugh, something went horribly wrong. I've downloaded the MDS
logs
     that contain assertion failure and it looks relevant to this
[0].
     Do you have client logs for this?

     The other log that you shared is being downloaded right now,
once
     that's done and I'm done going through it, I'll update you.

     [0] https://tracker.ceph.com/issues/54546

     On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
      wrote:

     Hi Dhairya,

     Sorry to resurrect this thread again, but we still
     unfortunately have an issue with our filesystem after we
     attempted to write new backups to it.

     We finished the scrub of the filesystem on Friday and ran
a
     repair scrub on the 1 directory which had metadata
damage.
     After doing so and rebooting, the cluster reported no

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-09 Thread Ivan Clayson

Hi Dhairya,

I would be more than happy to try and give as many details as possible 
but the slack channel is private and requires my email to have an 
account/ access to it.


Wouldn't taking the discussion about this error to a private channel 
also stop other users who experience this error from learning about how 
and why this happened as  well as possibly not be able to view the 
solution? Would it not be possible to discuss this more publicly for the 
benefit of the other users on the mailing list?


Kindest regards,

Ivan

On 09/07/2024 10:44, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Hey Ivan,

This is a relatively new MDS crash, so this would require some 
investigation but I was instructed to recommend disaster-recovery 
steps [0] (except session reset) to you to get the FS up again.


This crash is being discussed on upstream CephFS slack channel [1] 
with @Venky Shankar <mailto:vshan...@redhat.com> and other CephFS 
devs. I'd encourage you to join the conversation, we can discuss this 
in detail and maybe go through the incident step by step which should 
help analyse the crash better.


[0] 
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts

[1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson  
wrote:


Hi Dhairya,

Thank you ever so much for having another look at this so quickly.
I don't think I have any logs similar to the ones you referenced
this time as my MDSs don't seem to enter the replay stage when
they crash (or at least don't now after I've thrown the logs away)
but those errors do crop up in the prior logs I shared when the
system first crashed.

Kindest regards,

Ivan

On 08/07/2024 14:08, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the
sender and know the content is safe.
If you think this is a phishing email, please forward it to
phish...@mrc-lmb.cam.ac.uk


--

Ugh, something went horribly wrong. I've downloaded the MDS logs
that contain assertion failure and it looks relevant to this [0].
Do you have client logs for this?

The other log that you shared is being downloaded right now, once
that's done and I'm done going through it, I'll update you.

[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
 wrote:

Hi Dhairya,

Sorry to resurrect this thread again, but we still
unfortunately have an issue with our filesystem after we
attempted to write new backups to it.

We finished the scrub of the filesystem on Friday and ran a
repair scrub on the 1 directory which had metadata damage.
After doing so and rebooting, the cluster reported no issues
and data was accessible again.

We re-started the backups to run over the weekend and
unfortunately the filesystem crashed again where the log of
the failure is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
We ran the backups on kernel mounts of the filesystem without
the nowsync option this time to avoid the out-of-sync write
problems..

I've tried resetting the journal again after recovering the
dentries but unfortunately the filesystem is still in a
failed state despite setting joinable to true. The log of
this crash is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

I'm not sure how to proceed as I can't seem to get any MDS to
take over the first rank. I would like to do a scrub of the
filesystem and preferably overwrite the troublesome files
with the originals on the live filesystem. Do you have any
advice on how to make the filesystem leave its failed state?
I have a backup of the journal before I reset it so I can
roll back if necessary.

Here are some details about the filesystem at present:

root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosna

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Ivan Clayson

Hi Dhairya,

Thank you ever so much for having another look at this so quickly. I 
don't think I have any logs similar to the ones you referenced this time 
as my MDSs don't seem to enter the replay stage when they crash (or at 
least don't now after I've thrown the logs away) but those errors do 
crop up in the prior logs I shared when the system first crashed.


Kindest regards,

Ivan

On 08/07/2024 14:08, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Ugh, something went horribly wrong. I've downloaded the MDS logs that 
contain assertion failure and it looks relevant to this [0]. Do you 
have client logs for this?


The other log that you shared is being downloaded right now, once 
that's done and I'm done going through it, I'll update you.


[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson  
wrote:


Hi Dhairya,

Sorry to resurrect this thread again, but we still unfortunately
have an issue with our filesystem after we attempted to write new
backups to it.

We finished the scrub of the filesystem on Friday and ran a repair
scrub on the 1 directory which had metadata damage. After doing so
and rebooting, the cluster reported no issues and data was
accessible again.

We re-started the backups to run over the weekend and
unfortunately the filesystem crashed again where the log of the
failure is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
We ran the backups on kernel mounts of the filesystem without the
nowsync option this time to avoid the out-of-sync write problems..

I've tried resetting the journal again after recovering the
dentries but unfortunately the filesystem is still in a failed
state despite setting joinable to true. The log of this crash is
here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

I'm not sure how to proceed as I can't seem to get any MDS to take
over the first rank. I would like to do a scrub of the filesystem
and preferably overwrite the troublesome files with the originals
on the live filesystem. Do you have any advice on how to make the
filesystem leave its failed state? I have a backup of the journal
before I reset it so I can roll back if necessary.

Here are some details about the filesystem at present:

root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
flag(s) set
    1750 pgs not deep-scrubbed in time
    1612 pgs not scrubbed in time

  services:
    mon: 4 daemons, quorum
pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
    mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
pebbles-s3, pebbles-s4
    mds: 1/2 daemons up, 3 standby
    osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d);
10 remapped pgs
 flags
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

  data:
    volumes: 1/2 healthy, 1 recovering; 1 damaged
    pools:   7 pools, 2177 pgs
    objects: 3.24G objects, 6.7 PiB
    usage:   8.6 PiB used, 14 PiB / 23 PiB avail
    pgs: 11785954/27384310061 objects misplaced (0.043%)
 2167 active+clean
 6    active+remapped+backfilling
 4    active+remapped+backfill_wait

ceph_backup - 0 clients
===
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
    POOL    TYPE USED  AVAIL
   mds_backup_fs  metadata  1174G  3071G
ec82_primary_fs_data    data   0   3071G
  ec82pool  data    8085T  4738T
ceph_archive - 2 clients

RANK  STATE  MDS ACTIVITY DNS    INOS DIRS   CAPS
 0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105 118  2
    POOL    TYPE USED  AVAIL
   mds_archive_fs metadata  5184M  3071G
ec83_primary_fs_data    data   0   3071G
  ec83pool  data 138T  4307T
STANDBY MDS
 pebbles-s2
 pebbles-s3
 

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Ivan Clayson
024 15:17, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--



On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson  
wrote:


Hi Dhairya,

I would be more than happy to share our corrupted journal. Has the
host key changed for drop.ceph.com <http://drop.ceph.com>? The
fingerprint I'm being sent is
7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to
the one in our /usr/share/ceph/known_hosts_drop.ceph.com
<http://known_hosts_drop.ceph.com>.

Ah, strange. Let me get in touch with folks who might know about this, 
will revert back to you ASAP


Thank you for your advice as well. We've reset our MDS' journal
and are currently in the process of a full filesystem scrub which
understandably is taking quite a bit of time but seems to be
progressing through the objects fine.

YAY!

Thank you ever so much for all your help and please do feel free
to follow up with us if you would like any further details about
our crash!

Glad to hear it went well, this bug is being worked on with high 
priority and once the patch is ready, it will be backported.


The root cause of this issue is the `nowsync` (async dirops) being 
enabled by default with kclient [0]. This feature allows asynchronous 
creation and deletion of files, optimizing performance by avoiding 
round-trip latency for these system calls. However, in very rare cases 
(like yours :D), it can affect the system's consistency and stability 
hence if this kind of optimization is not a priority for your 
workload, I recommend turning it off by switching the mount points to 
`wsync` and also set the MDS config `mds_client_delegate_inos_pct` to 
`0` so that you don't end up in this situation again (until the bug 
fix arrives :)).


[0] 
https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902


Kindest regards,

Ivan

On 27/06/2024 12:39, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the
sender and know the content is safe.
If you think this is a phishing email, please forward it to
phish...@mrc-lmb.cam.ac.uk


--

Hi Ivan,

The solution (which has been successful for us in the past) is to
reset the journal. This would bring the fs back online and return
the MDSes to a stable state, but some data would be lost—the data
in the journal that hasn't been flushed to the backing store
would be gone. Therefore, you should try to flush out as much
journal data as possible before resetting the journal.

Here are the steps for this entire process:

1) Bring the FS offline
$ ceph fs fail 

2) Recover dentries from journal (run it with every MDS Rank)
$ cephfs-journal-tool --rank=: event
recover_dentries summary

3) Reset the journal (again with every MDS Rank)
$ cephfs-journal-tool --rank=: journal reset

4) Bring the FS online
$ cephfs fs set  joinable true

5) Restart the MDSes

6) Perform scrub to ensure consistency of fs
$ ceph tell mds.:0 scrub start  [scrubopts] [tag]
# you could try a recursive scrub maybe `ceph tell
mds.:0 scrub start / recursive`

Some important notes to keep in mind:
* Recovering dentries will take time (generally, rank 0 is the
most time-consuming, but the rest should be quick).
* cephfs-journal-tool and metadata OSDs are bound to use a
significant CPU percentage. This is because cephfs-journal-tool
has to swig the journal data and flush it out to the backing
store, which also makes the metadata operations go rampant,
resulting in OSDs taking a significant percentage of CPU.

Do let me know how this goes.

On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
 wrote:

Hi Dhairya,

We can induce the crash by simply restarting the MDS and the
crash seems to happen when an MDS goes from up:standby to
up:replay. The MDS works through a few files in the log
before eventually crashing where I've included the logs for
this here (this is after I imported the backed up journal
which I hope was successful but please let me know if you
suspect it wasn't!):

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

With respect to the client logs, are you referring to the
clients who are writing to the filesystem? We don't typically
run them in any sort of debug mode and we have quite a few
machines running our backup system but we can look an hour or
   

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-28 Thread Ivan Clayson

Hi Dhairya,

I would be more than happy to share our corrupted journal. Has the host 
key changed for drop.ceph.com? The fingerprint I'm being sent is 
7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the 
one in our /usr/share/ceph/known_hosts_drop.ceph.com.


Thank you for your advice as well. We've reset our MDS' journal and are 
currently in the process of a full filesystem scrub which understandably 
is taking quite a bit of time but seems to be progressing through the 
objects fine.


Thank you ever so much for all your help and please do feel free to 
follow up with us if you would like any further details about our crash!


Kindest regards,

Ivan

On 27/06/2024 12:39, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Hi Ivan,

The solution (which has been successful for us in the past) is to 
reset the journal. This would bring the fs back online and return the 
MDSes to a stable state, but some data would be lost—the data in the 
journal that hasn't been flushed to the backing store would be gone. 
Therefore, you should try to flush out as much journal data as 
possible before resetting the journal.


Here are the steps for this entire process:

1) Bring the FS offline
$ ceph fs fail 

2) Recover dentries from journal (run it with every MDS Rank)
$ cephfs-journal-tool --rank=: event 
recover_dentries summary


3) Reset the journal (again with every MDS Rank)
$ cephfs-journal-tool --rank=: journal reset

4) Bring the FS online
$ cephfs fs set  joinable true

5) Restart the MDSes

6) Perform scrub to ensure consistency of fs
$ ceph tell mds.:0 scrub start  [scrubopts] [tag]
# you could try a recursive scrub maybe `ceph tell mds.:0 
scrub start / recursive`


Some important notes to keep in mind:
* Recovering dentries will take time (generally, rank 0 is the most 
time-consuming, but the rest should be quick).
* cephfs-journal-tool and metadata OSDs are bound to use a significant 
CPU percentage. This is because cephfs-journal-tool has to swig the 
journal data and flush it out to the backing store, which also makes 
the metadata operations go rampant, resulting in OSDs taking a 
significant percentage of CPU.


Do let me know how this goes.

On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson  
wrote:


Hi Dhairya,

We can induce the crash by simply restarting the MDS and the crash
seems to happen when an MDS goes from up:standby to up:replay. The
MDS works through a few files in the log before eventually
crashing where I've included the logs for this here (this is after
I imported the backed up journal which I hope was successful but
please let me know if you suspect it wasn't!):

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

With respect to the client logs, are you referring to the clients
who are writing to the filesystem? We don't typically run them in
any sort of debug mode and we have quite a few machines running
our backup system but we can look an hour or so before the first
MDS crash (though I don't know if this is when the de-sync
occurred). Here are some MDS logs with regards to the initial
crash on Saturday morning though which may be helpful:

   -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
monclient: tick
   -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
monclient: _check_auth_rotating have uptodate secrets (they
expire after 2024-06-22T05:41:13.091556+0100)
   -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1
mds.pebbles-s2 Updating MDS map to version 2529650 from mon.3
   -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
mds.0.purge_queue operator():  data pool 6 not found in OSDMap
   -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
mds.0.purge_queue operator():  data pool 3 not found in OSDMap
   -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
asok(0x5592e7968000) register_command objecter_requests hook
0x5592e78f8800
   -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10
monclient: _renew_subs
   -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10
monclient: _send_mon_message to mon.pebbles-s4 at
v2:10.1.5.134:3300/0 <http://10.1.5.134:3300/0>
   -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
log_channel(cluster) update_config to_monitors: true
to_syslog: false syslog_facility:  prio: info to_graylog:
false graylog_host: 127.0.0.1 graylog_port: 12201)
   -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
mds.0.purge_queue operator

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-27 Thread Ivan Clayson
) [0x7f18568b6669]
 6: (interval_set::erase(inodeno_t, inodeno_t,
   std::function)+0x2e5) [0x5592e5027885]
 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
   MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
 8: (EUpdate::replay(MDSRank*)+0x61) [0x5592e5330bd1]
 9: (MDLog::_replay_thread()+0x7bb) [0x5592e52b754b]
 10: (MDLog::ReplayThread::entry()+0x11) [0x5592e4f6a041]
 11: /lib64/libpthread.so.0(+0x81ca) [0x7f18558a41ca]
 12: clone()

We have a relatively low debug setting normally so I don't think many 
details of the initial crash were captured unfortunately and the MDS 
logs before the above (i.e. "-60" and older) are just beacon messages 
and _check_auth_rotating checks.


I was wondering whether you have any recommendations in terms of what 
actions we could take to bring our filesystem back into a working state 
short of rebuilding the entire metadata pool? We are quite keen to bring 
our backup back into service urgently as we currently do not have any 
accessible backups for our Ceph clusters.


Kindest regards,

Ivan

On 25/06/2024 19:18, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--



On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson  
wrote:


Hi Dhairya,

Thank you for your rapid reply. I tried recovering the dentries
for the file just before the crash I mentioned before and then
splicing the transactions from the journal which seemed to remove
that issue for that inode but resulted in the MDS crashing on the
next inode in the journal when performing replay.

The MDS delegates a range of preallocated inodes (in form of a set - 
interval_set preallocated_inos) to the clients, so it can 
be one inode that is untracked or some inodes from the range or in 
worst case scenario - ALL, and this is something that even the 
`cephfs-journal-tool` would not be able to tell (since we're talking 
about MDS internals which aren't exposed to such tools). That is the 
reason why you see "MDS crashing on the next inode in the journal when 
performing replay".


An option could be to expose the inode set to some tool or asok cmd to 
identify such inodes ranges, which needs to be discussed. For now, 
we're trying to address this in [0], you can follow the discussion there.


[0] https://tracker.ceph.com/issues/66251

Removing all the transactions involving the directory housing the
files that seemed to cause these crashes from the journal only
caused the MDS to fail to even start replay.

I've rolled back our journal to our original version when the
crash first happened and the entire MDS log for the crash can be
found here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24

Awesome, this would help us a ton. Apart from this, would it be 
possible to send us client logs?


Please let us know if you would like any other logs file as we can
easily induce this crash.

Since you can easily induce the crash, can you share the reproducer 
please i.e. what all action you take in order to hit this?


Kindest regards,

Ivan

On 25/06/2024 09:58, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the
sender and know the content is safe.
If you think this is a phishing email, please forward it to
phish...@mrc-lmb.cam.ac.uk


--

Hi Ivan,

This looks to be similar to the issue [0] that we're already
addressing at [1]. So basically there is some out-of-sync event
that led the client to make use of the inodes that MDS wasn't
aware of/isn't tracking and hence the crash. It'd be really
helpful if you can provide us more logs.

CC @Rishabh Dave <mailto:rid...@redhat.com> @Venky Shankar
<mailto:vshan...@redhat.com> @Patrick Donnelly
<mailto:pdonn...@redhat.com> @Xiubo Li <mailto:xiu...@redhat.com>

[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
***Dhairya Parmar*

Associate Software Engineer, CephFS

<https://www.redhat.com/>IBM, Inc.


On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
 wrote:

Hello,

We have been experiencing a serious issue with our CephFS
backup cluster
running quincy (version 17.2.7) on a RHEL8-derivative Linux
kernel
(Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our
filesystem are
constantly in a "replay" or "replay(laggy)" state and keep
crashing.

We have a single MDS filesyst

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-25 Thread Ivan Clayson

Hi Dhairya,

Thank you for your rapid reply. I tried recovering the dentries for the 
file just before the crash I mentioned before and then splicing the 
transactions from the journal which seemed to remove that issue for that 
inode but resulted in the MDS crashing on the next inode in the journal 
when performing replay. Removing all the transactions involving the 
directory housing the files that seemed to cause these crashes from the 
journal only caused the MDS to fail to even start replay.


I've rolled back our journal to our original version when the crash 
first happened and the entire MDS log for the crash can be found here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24


Please let us know if you would like any other logs file as we can 
easily induce this crash.


Kindest regards,

Ivan

On 25/06/2024 09:58, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Hi Ivan,

This looks to be similar to the issue [0] that we're already 
addressing at [1]. So basically there is some out-of-sync event that 
led the client to make use of the inodes that MDS wasn't aware 
of/isn't tracking and hence the crash. It'd be really helpful if you 
can provide us more logs.


CC @Rishabh Dave <mailto:rid...@redhat.com> @Venky Shankar 
<mailto:vshan...@redhat.com> @Patrick Donnelly 
<mailto:pdonn...@redhat.com> @Xiubo Li <mailto:xiu...@redhat.com>


[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
***Dhairya Parmar*

Associate Software Engineer, CephFS

<https://www.redhat.com/>IBM, Inc.


On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson  
wrote:


Hello,

We have been experiencing a serious issue with our CephFS backup
cluster
running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
(Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our
filesystem are
constantly in a "replay" or "replay(laggy)" state and keep crashing.

We have a single MDS filesystem called "ceph_backup" with 2 standby
MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
little to no data) where we are using our "ceph_backup" filesystem to
backup our data and this is the one which is currently broken. The
Ceph
health outputs currently are:

    root@pebbles-s1 14:05 [~]: ceph -s
       cluster:
     id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
     health: HEALTH_WARN
     1 filesystem is degraded
     insufficient standby MDS daemons available
     1319 pgs not deep-scrubbed in time
     1054 pgs not scrubbed in time

       services:
     mon: 4 daemons, quorum
    pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
     mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
    pebbles-s3, pebbles-s1
     mds: 2/2 daemons up
     osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
    remapped pgs

       data:
     volumes: 1/2 healthy, 1 recovering
     pools:   7 pools, 2177 pgs
     objects: 3.55G objects, 7.0 PiB
     usage:   8.9 PiB used, 14 PiB / 23 PiB avail
     pgs: 83133528/30006841533 objects misplaced (0.277%)
      2090 active+clean
      47   active+clean+scrubbing+deep
      29   active+remapped+backfilling
      8    active+remapped+backfill_wait
      2    active+clean+scrubbing
      1    active+clean+snaptrim

       io:
     recovery: 1.9 GiB/s, 719 objects/s

    root@pebbles-s1 14:09 [~]: ceph fs status
    ceph_backup - 0 clients
    ===
    RANK  STATE MDS  ACTIVITY   DNS    INOS DIRS CAPS
      0    replay(laggy)  pebbles-s3   0  0 0  0
     POOL    TYPE USED  AVAIL
        mds_backup_fs  metadata  1255G  2780G
    ec82_primary_fs_data    data   0   2780G
       ec82pool  data    8442T  3044T
    ceph_archive - 2 clients
    
    RANK  STATE  MDS ACTIVITY DNS    INOS DIRS CAPS
      0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105 118 2
     POOL    TYPE USED  AVAIL
        mds_archive_fs metadata  5184M  2780G
    ec83_primary_fs_data    data   0   2780G
       ec83pool  data 138T  2767T
    MDS version: ceph version 17.2.7
    (b12291d110049b2f35e32e0de30d70e9a4c

[ceph-users] CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-24 Thread Ivan Clayson
quot;) where failing that we could erase this problematic 
event with "cephfs-journal-tool --rank=ceph_backup:0 event splice 
--inode 1101069090357". Is this a good idea? We would rather not rebuild 
the entire metadata pool if we could avoid it (once was enough for us) 
as this cluster has ~9 PB of data on it.


Kindest regards,

Ivan Clayson

--
Ivan Clayson
-
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

2024-03-19 Thread Ivan Clayson

Hello Gregory and Nathan,

Having a look at our resource utilization, there doesn't seem to be a 
CPU or memory bottleneck as there is plenty of both available for the 
host which has the blocked OSD as well for the MDS' host.


We've had a repeated of this problem today where the OSD logging slow 
ops did not have any ops in flight despite: (i) being in an active 
state, (ii) clients requesting I/O from this OSD, and (iii) the MDS 
reporting that it was unable to get a rdlock. The blocked op reported by 
the MDS was initially related to our backups (but is not always) where 
this takes a snapshot every night and then we back up the snapshot to 
another Ceph cluster. We then delete this snapshot after we've backed it up.


   # the blocked op on the MDS is related to backing up a snapshot of
   the file $FILE:
   ~$ ceph tell mds.0 dump_blocked_ops
            "description": "client_request(client.90018803:265050
   getattr AsLsXsFs #0x14e2452//1710815659/... ... caller_uid=...,
   caller_gid=...)"
    "initiated_at": "...",
    "age": ...,
    "duration": ...,
    "type_data": {
    "flag_point": "failed to rdlock, waiting",
   ...

   ~$ ls -lrt $FILE
    # ls -lrt hangs as it hangs on a statx syscall on the file
   where this then comes up as another blocked op in the MDS op list

   ~$ ceph tell mds.0 dump_blocked_ops
   
   client_request(client.91265572:7 getattr AsLsFs #0x1002fe5d755 ...
   caller_uid=..., caller_gid=...)

   root@client-whose-held-active-cap-for-1002fe5d755-the-longest ~$
   grep 1002fe5d755 /sys/kernel/debug/ceph/*/osdc
   1652    osd7    3.3519a4ff  3.4ffs0
   [7,132,61,143,109,98,18,44,269,238]/7
   [7,132,61,143,109,98,18,44,269,238]/7   e159072
   1002fe5d755.0011    0x400024    1   write

   ~$ systemctl status --no-pager --full ceph-osd@7
   
   ceph-osd[1184036]:  osd.7 158945 get_health_metrics reporting 8 slow
   ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0
   3:ff28a5a4:::1002feddfaa.:head [write 0~4194304 [1@-1]
   in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942)
   ceph-osd[1184036]:  osd.7 158945 get_health_metrics reporting 6 slow
   ops, oldest is osd_op(client.90099026.0:4068839 3.4ffs0
   3:ff28a5a4:::1002feddfaa.:head [write 0~4194304 [1@-1]
   in=4194304b] snapc 6af1=[] ondisk+write+known_if_redirected e158942)

There was nothing in dmesg or wrong with the HDD for osd.7 (or any 
drives for that matter) and osd.7 reported no blocked ops or any ops in 
flight from the daemon via `ceph tell`. However when looking at the 
historic slow ops, the oldest one still saved related to this stuck 
$FILE object (1002fe5d755) and it seems that about half of the recorded 
historical slow ops are about this PG with them all occurring around the 
same time the OSD slow ops started occurring:


   ~$ ceph tell osd.7 dump_historic_slow_ops
   "description": "osd_op(client.89624569.0:1151567 3.4ffs0
   3:ff27ac5d:::1002fea2a90.000a:head [write 0~4194304] snapc
   6ab9=[6ab9] ondisk+write+known_if_redirected e158919)",
    "initiated_at": "...",
    "age": ...,
    "duration": ...,
    "type_data": {
    "flag_point": "commit sent; apply or cleanup",
            ...
    {
    "event": "header_read",
    "time": "2024-03-18T19:43:41.875596+",
    "duration": 4294967295.967
    },

I've highlighted this header_read duration as this took apparently ~136 
years(!) so there seems to be something off maybe with the messenger layer.


I would be eager to hear what your thoughts are on this as it seems 
after awhile the OSD "forgets" about this slow op and stops reporting it 
in the log. I'm also curious about your thoughts on this being related 
to the number of snapshots we have as we get rid of the snapshot on this 
filesystem when we've copied over to the backup system but could this 
still cause problems and or are they issues with snaps?


Kindest regards,

Ivan

On 15/03/2024 18:07, Gregory Farnum wrote:

CAUTION: This email originated from outside of the LMB:
*.-gfar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

On Fri, Mar 15, 2024 at 6:15 AM Ivan Clayson  
wrote:


Hello everyone,

We've been experiencing on our quincy CephFS clusters (one 17.2.6 and
another 17.2.7) repeated sl

[ceph-users] MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

2024-03-15 Thread Ivan Clayson

Hello everyone,

We've been experiencing on our quincy CephFS clusters (one 17.2.6 and 
another 17.2.7) repeated slow ops with our client kernel mounts 
(Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to 
originate from slow ops on osds despite the underlying hardware being 
fine. Our 2 clusters are similar and are both Alma8 systems where more 
specifically:


 * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
   OSDs storing the metadata and 432 spinning SATA disks storing the
   bulk data in an EC pool (8 data shards and 2 parity blocks) across
   40 nodes. The whole cluster is used to support a single file system
   with 1 active MDS and 2 standby ones.
 * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
   OSDs storing the metadata and 348 spinning SAS disks storing the
   bulk data in EC pools  (8 data shards and 2 parity blocks). This
   cluster houses multiple filesystems each with their own dedicated
   MDS along with 3 communal standby ones.

Nearly daily we often find that we're get the following error messages: 
MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST. 
Along with these messages, certain files and directory cannot be stat-ed 
and any processes involving these files hang indefinitely. We have been 
fixing this by:


   1. First, finding the oldest blocked MDS op and the inode listed there:

   ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
   -c description

   "description": "client_request(client.251247219:662 getattr
   AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+
   caller_uid=26983, caller_gid=26983)",

   # inode/ object of interest: 100922d1102

   2. Second, finding all the current clients that have a cap for this
   blocked inode from the faulty MDS' session list (i.e. ceph tell
   mds.${my_mds} session ls --cap-dump) and then examining the client
   whose had the cap the longest:

   ~$ ceph tell mds.${my_mds} session ls --cap-dump ...

   2024-03-13T13:01:36: client.251247219

   2024-03-13T12:50:28: client.245466949

   3. Then on the aforementioned oldest client, get the current ops in
   flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
   and get the op corresponding to the blocked inode along with the OSD
   the I/O is going to:

   root@client245466949 $ grep 100922d1102
   /sys/kernel/debug/ceph/*/osdc

   48366  osd79 2.249f8a51  2.a51s0
   [79,351,232,179,107,195,323,14,128,167]/79
   [79,351,232,179,107,195,323,14,128,167]/79  e374191
   100922d1102.00f5  0x400024  1 write

   # osd causing errors is osd.79

   4. Finally, we restart this "hanging" OSD where this results in ls
   and I/O on the previously "stuck" files no longer "hanging" .

Once we get this OSD for which the blocked inode is waiting for, we can 
see in the system logs that the OSD has slow ops:


~$ systemctl --no-pager --full status ceph-osd@79

   ...
   2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
   slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
   2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
   ...

Files that these "hanging" inodes correspond to as well as the 
directories housing these files can't be opened or stat-ed (causing 
directories to hang) where we've found restarting this OSD with slow ops 
to be the least disruptive way of resolving this (compared with a forced 
umount and then re-mount on the client). There are no issues with the 
underlying hardware for either the osd reporting these slow ops or any 
other drive within the acting PG and there seems to be no correlation 
between what processes are involved or what type of files these are.


What could be causing these slow ops and certain files and directories 
to "hang"? There aren't workflows being performed that generate a large 
number of small files nor are there directories with a large number of 
files within them. This seems to happen with a wide range of hard-drives 
and we see this on SATA and SAS type drives where our nodes are 
interconnected with 25 Gb/s NICs so we can't see how the underlying 
hardware would be causing any I/O bottlenecks. Has anyone else seen this 
type of behaviour before and have any ideas? Is there a way to stop 
these from happening as we are having to solve these nearly daily now 
and we can't seem to find a way to reduce them. We do use snapshots to 
backup our cluster where we've been doing this for ~6 months and these 
issues have only been occurring on and off for a couple of months but 
much more frequently now.



Kindest regards,

Ivan Clayson

--
Ivan Clayson
-
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology

[ceph-users] Re: Clients failing to respond to capability release

2023-10-12 Thread Ivan Clayson
quot; which was similarly tackled 
by restarting the MDS that just took over. This finally resulted in only 
two clients failing to respond to caps releases on inodes they were 
holding (despite rebooting at the time) where performing a "ceph tell 
mds.N session kill CLIENT_ID" removed them from the session map and 
allow the MDS' cache to become manageable again, thereby clearing all of 
these warning messages.


We've had this problem since the beginning of this year and upgrading 
from octopus to quincy has unfortunately not solved our problem. We've 
only really been able to solve this problem by undergoing an aggressive 
campaign of replacing hard-drives which were reaching the end of their 
lives. This has substantially reduced the amount of problems we've had 
in relation to this.


We would be very interested to hear about the rest of the community's 
experience in relation to this and I would recommend looking at your 
underlying OSDs Tim to see whether there are any timeout or 
uncorrectable errors. We would also be very eager to hear if these 
approaches are sub-optimal and whether anyone else has any insight into 
our problems. Sorry as well for resurrecting an old thread but we 
thought our experiences may be helpfully for others!


Kindest regards,

Ivan Clayson

On 19/09/2023 12:35, Tim Bishop wrote:

Hi,

I've seen this issue mentioned in the past, but with older releases. So
I'm wondering if anybody has any pointers.

The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
clients are working fine, with the exception of our backup server. This
is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
(so I suspect a newer Ceph version?).

The backup server has multiple (12) CephFS mount points. One of them,
the busiest, regularly causes this error on the cluster:

HEALTH_WARN 1 clients failing to respond to capability release
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
release
 mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing to 
respond to capability release client_id: 521306112

And occasionally, which may be unrelated, but occurs at the same time:

[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
 mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs

The second one clears itself, but the first sticks until I can unmount
the filesystem on the client after the backup completes.

It appears that whilst it's in this stuck state there may be one or more
directory trees that are inaccessible to all clients. The backup server
is walking the whole tree but never gets stuck itself, so either the
inaccessible directory entry is caused after it has gone past, or it's
not affected. Maybe the backup server is holding a directory when it
shouldn't?

It may be that an upgrade to Quincy resolves this, since it's more
likely to be inline with the kernel client version wise, but I don't
want to knee-jerk upgrade just to try and fix this problem.

Thanks for any advice.

Tim.

[1] The reason for the newer kernel is that the backup performance from
CephFS was terrible with older kernels. This newer kernel does at least
resolve that issue.
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io


--
CAUTION: This email originated from outside of the LMB.
Do not click links or open attachments unless you recognize the sender and know 
the content is safe.
.-ceph-users-boun...@ceph.io-.


--
Ivan Clayson
-
Scientific Computing Officer
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io