[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-10 Thread Ivan Clayson

Hi Tim,

Alma8's active support ended in May this year and henceforth there are 
only security updates. But you make a good point and we are moving 
toward Alma9 very shortly!


Whilst we're mentioning distributions, we've had quite a good experience 
with Alma (notwithstanding our current but unrelated troubles) and we 
would recommend it.


Kindest regards,

Ivan

On 09/07/2024 16:19, Tim Holloway wrote:

CAUTION: This email originated from outside of the LMB:
.-t...@mousetech.com-.
Do not click links or open attachments unless you recognize the sender and know 
the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk
--

Ivan,

This may be a little off-topic, but if you're still running AlmaLinux
8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2
years ago, thanks to CentOS Stream.

Up until this last week, however, I had several AlmaLinux 8 machines
running myself, but apparently somewhere around May IBM Red Hat pulled
all of its CentOS8 enterprise sites offline, including Storage and
Ceph, which broke my yum updates.

While as far as I'm aware, once you've installed cephadm (whether via
yum/dnf or otherwise), there's no further need for the RPM repos,
losing yum support is not helping at the very least.

On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to
AlmaLinux 9, although it may require temporarily disabling certain OS
services to appease the upgrade process.

Probably won't solve your problem, but at least you'll be able to move
fairly painlessly to a better-supported platform.

   Best Regards,
  Tim

On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote:

Hi Dhairya,

I would be more than happy to try and give as many details as
possible
but the slack channel is private and requires my email to have an
account/ access to it.

Wouldn't taking the discussion about this error to a private channel
also stop other users who experience this error from learning about
how
and why this happened as  well as possibly not be able to view the
solution? Would it not be possible to discuss this more publicly for
the
benefit of the other users on the mailing list?

Kindest regards,

Ivan

On 09/07/2024 10:44, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the
sender
and know the content is safe.
If you think this is a phishing email, please forward it to
phish...@mrc-lmb.cam.ac.uk


--

Hey Ivan,

This is a relatively new MDS crash, so this would require some
investigation but I was instructed to recommend disaster-recovery
steps [0] (except session reset) to you to get the FS up again.

This crash is being discussed on upstream CephFS slack channel [1]
with @Venky Shankar  and other CephFS
devs. I'd encourage you to join the conversation, we can discuss
this
in detail and maybe go through the incident step by step which
should
help analyse the crash better.

[0]
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
[1]
https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson

wrote:

     Hi Dhairya,

     Thank you ever so much for having another look at this so
quickly.
     I don't think I have any logs similar to the ones you
referenced
     this time as my MDSs don't seem to enter the replay stage when
     they crash (or at least don't now after I've thrown the logs
away)
     but those errors do crop up in the prior logs I shared when the
     system first crashed.

     Kindest regards,

     Ivan

     On 08/07/2024 14:08, Dhairya Parmar wrote:

     CAUTION: This email originated from outside of the LMB:
     *.-dpar...@redhat.com-.*
     Do not click links or open attachments unless you recognize
the
     sender and know the content is safe.
     If you think this is a phishing email, please forward it to
     phish...@mrc-lmb.cam.ac.uk


     --

     Ugh, something went horribly wrong. I've downloaded the MDS
logs
     that contain assertion failure and it looks relevant to this
[0].
     Do you have client logs for this?

     The other log that you shared is being downloaded right now,
once
     that's done and I'm done going through it, I'll update you.

     [0] https://tracker.ceph.com/issues/54546

     On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
      wrote:

     Hi Dhairya,

     Sorry to resurrect this thread again, but we still
     unfortunately have an issue with our filesystem after we
     attempted to write new backups to it.

     We finished the scrub of the filesystem on Friday and ran
a
     repair scrub on the 1 directory which had metadata
damage.
     After doing so and rebooting, the cluster reported no
issues
     and data was accessible again.

     We re-started the backups to run over the weekend and
 

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-09 Thread Tim Holloway
Ivan,

This may be a little off-topic, but if you're still running AlmaLinux
8,9, it's worth noting that CentOS 8 actually end-of-lifed about 2
years ago, thanks to CentOS Stream.

Up until this last week, however, I had several AlmaLinux 8 machines
running myself, but apparently somewhere around May IBM Red Hat pulled
all of its CentOS8 enterprise sites offline, including Storage and
Ceph, which broke my yum updates.

While as far as I'm aware, once you've installed cephadm (whether via
yum/dnf or otherwise), there's no further need for the RPM repos,
losing yum support is not helping at the very least.

On the upside, it's possible to upgrade-in-place from AlmaLinux 8.9 to
AlmaLinux 9, although it may require temporarily disabling certain OS
services to appease the upgrade process.

Probably won't solve your problem, but at least you'll be able to move
fairly painlessly to a better-supported platform.

  Best Regards,
 Tim

On Tue, 2024-07-09 at 11:14 +0100, Ivan Clayson wrote:
> Hi Dhairya,
> 
> I would be more than happy to try and give as many details as
> possible 
> but the slack channel is private and requires my email to have an 
> account/ access to it.
> 
> Wouldn't taking the discussion about this error to a private channel 
> also stop other users who experience this error from learning about
> how 
> and why this happened as  well as possibly not be able to view the 
> solution? Would it not be possible to discuss this more publicly for
> the 
> benefit of the other users on the mailing list?
> 
> Kindest regards,
> 
> Ivan
> 
> On 09/07/2024 10:44, Dhairya Parmar wrote:
> > CAUTION: This email originated from outside of the LMB:
> > *.-dpar...@redhat.com-.*
> > Do not click links or open attachments unless you recognize the
> > sender 
> > and know the content is safe.
> > If you think this is a phishing email, please forward it to 
> > phish...@mrc-lmb.cam.ac.uk
> > 
> > 
> > --
> > 
> > Hey Ivan,
> > 
> > This is a relatively new MDS crash, so this would require some 
> > investigation but I was instructed to recommend disaster-recovery 
> > steps [0] (except session reset) to you to get the FS up again.
> > 
> > This crash is being discussed on upstream CephFS slack channel [1] 
> > with @Venky Shankar  and other CephFS 
> > devs. I'd encourage you to join the conversation, we can discuss
> > this 
> > in detail and maybe go through the incident step by step which
> > should 
> > help analyse the crash better.
> > 
> > [0] 
> > https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> > [1]
> > https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519
> > 
> > On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson
> >  
> > wrote:
> > 
> >     Hi Dhairya,
> > 
> >     Thank you ever so much for having another look at this so
> > quickly.
> >     I don't think I have any logs similar to the ones you
> > referenced
> >     this time as my MDSs don't seem to enter the replay stage when
> >     they crash (or at least don't now after I've thrown the logs
> > away)
> >     but those errors do crop up in the prior logs I shared when the
> >     system first crashed.
> > 
> >     Kindest regards,
> > 
> >     Ivan
> > 
> >     On 08/07/2024 14:08, Dhairya Parmar wrote:
> > >     CAUTION: This email originated from outside of the LMB:
> > >     *.-dpar...@redhat.com-.*
> > >     Do not click links or open attachments unless you recognize
> > > the
> > >     sender and know the content is safe.
> > >     If you think this is a phishing email, please forward it to
> > >     phish...@mrc-lmb.cam.ac.uk
> > > 
> > > 
> > >     --
> > > 
> > >     Ugh, something went horribly wrong. I've downloaded the MDS
> > > logs
> > >     that contain assertion failure and it looks relevant to this
> > > [0].
> > >     Do you have client logs for this?
> > > 
> > >     The other log that you shared is being downloaded right now,
> > > once
> > >     that's done and I'm done going through it, I'll update you.
> > > 
> > >     [0] https://tracker.ceph.com/issues/54546
> > > 
> > >     On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
> > >      wrote:
> > > 
> > >     Hi Dhairya,
> > > 
> > >     Sorry to resurrect this thread again, but we still
> > >     unfortunately have an issue with our filesystem after we
> > >     attempted to write new backups to it.
> > > 
> > >     We finished the scrub of the filesystem on Friday and ran
> > > a
> > >     repair scrub on the 1 directory which had metadata
> > > damage.
> > >     After doing so and rebooting, the cluster reported no
> > > issues
> > >     and data was accessible again.
> > > 
> > >     We re-started the backups to run over the weekend and
> > >     unfortunately the filesystem crashed again where the log
> > > of
> > >     the failure is here:
> > >    
> > > https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-09 Thread Dhairya Parmar
On Tue, Jul 9, 2024 at 3:46 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> I would be more than happy to try and give as many details as possible but
> the slack channel is private and requires my email to have an account/
> access to it.
>
You're right in the context that you're required to have an account on
slack; it isn't private at all. The slack channel is open for all, (it's
upstream slack channel :D) it's just that you need to access it with an
email but again it's all your choice, not mandatory. I'd ask @Venky Shankar
 @Patrick Donnelly  to add their
input since they've been working on similar issues and can provide better
insights.

> Wouldn't taking the discussion about this error to a private channel also
> stop other users who experience this error from learning about how and why
> this happened as  well as possibly not be able to view the solution? Would
> it not be possible to discuss this more publicly for the benefit of the
> other users on the mailing list?
>
Kindest regards,
>
> Ivan
> On 09/07/2024 10:44, Dhairya Parmar wrote:
>
> CAUTION: This email originated from outside of the LMB:
> *.-dpar...@redhat.com-.*
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe.
> If you think this is a phishing email, please forward it to
> phish...@mrc-lmb.cam.ac.uk
>
>
> --
> Hey Ivan,
>
> This is a relatively new MDS crash, so this would require some
> investigation but I was instructed to recommend disaster-recovery steps [0]
> (except session reset) to you to get the FS up again.
>
> This crash is being discussed on upstream CephFS slack channel [1] with @Venky
> Shankar  and other CephFS devs. I'd encourage you to
> join the conversation, we can discuss this in detail and maybe go through
> the incident step by step which should help analyse the crash better.
>
> [0]
> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> [1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519
>
> On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson 
> wrote:
>
>> Hi Dhairya,
>>
>> Thank you ever so much for having another look at this so quickly. I
>> don't think I have any logs similar to the ones you referenced this time as
>> my MDSs don't seem to enter the replay stage when they crash (or at least
>> don't now after I've thrown the logs away) but those errors do crop up in
>> the prior logs I shared when the system first crashed.
>>
>> Kindest regards,
>>
>> Ivan
>> On 08/07/2024 14:08, Dhairya Parmar wrote:
>>
>> CAUTION: This email originated from outside of the LMB:
>> *.-dpar...@redhat.com-.*
>> Do not click links or open attachments unless you recognize the sender
>> and know the content is safe.
>> If you think this is a phishing email, please forward it to
>> phish...@mrc-lmb.cam.ac.uk
>>
>>
>> --
>> Ugh, something went horribly wrong. I've downloaded the MDS logs that
>> contain assertion failure and it looks relevant to this [0]. Do you have
>> client logs for this?
>>
>> The other log that you shared is being downloaded right now, once that's
>> done and I'm done going through it, I'll update you.
>>
>> [0] https://tracker.ceph.com/issues/54546
>>
>> On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson 
>> wrote:
>>
>>> Hi Dhairya,
>>>
>>> Sorry to resurrect this thread again, but we still unfortunately have an
>>> issue with our filesystem after we attempted to write new backups to it.
>>>
>>> We finished the scrub of the filesystem on Friday and ran a repair scrub
>>> on the 1 directory which had metadata damage. After doing so and rebooting,
>>> the cluster reported no issues and data was accessible again.
>>>
>>> We re-started the backups to run over the weekend and unfortunately the
>>> filesystem crashed again where the log of the failure is here:
>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
>>> We ran the backups on kernel mounts of the filesystem without the nowsync
>>> option this time to avoid the out-of-sync write problems..
>>>
>>> I've tried resetting the journal again after recovering the dentries but
>>> unfortunately the filesystem is still in a failed state despite setting
>>> joinable to true. The log of this crash is here:
>>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
>>> .
>>>
>>> I'm not sure how to proceed as I can't seem to get any MDS to take over
>>> the first rank. I would like to do a scrub of the filesystem and preferably
>>> overwrite the troublesome files with the originals on the live filesystem.
>>> Do you have any advice on how to make the filesystem leave its failed
>>> state? I have a backup of the journal before I reset it so I can roll back
>>> if necessary.
>>>
>>> Here are some details about the filesystem at present:
>>>
>>> root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
>>>   cluster:
>>> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>>> health: HEALTH_ERR
>>>  

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-09 Thread Ivan Clayson

Hi Dhairya,

I would be more than happy to try and give as many details as possible 
but the slack channel is private and requires my email to have an 
account/ access to it.


Wouldn't taking the discussion about this error to a private channel 
also stop other users who experience this error from learning about how 
and why this happened as  well as possibly not be able to view the 
solution? Would it not be possible to discuss this more publicly for the 
benefit of the other users on the mailing list?


Kindest regards,

Ivan

On 09/07/2024 10:44, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Hey Ivan,

This is a relatively new MDS crash, so this would require some 
investigation but I was instructed to recommend disaster-recovery 
steps [0] (except session reset) to you to get the FS up again.


This crash is being discussed on upstream CephFS slack channel [1] 
with @Venky Shankar  and other CephFS 
devs. I'd encourage you to join the conversation, we can discuss this 
in detail and maybe go through the incident step by step which should 
help analyse the crash better.


[0] 
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts

[1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson  
wrote:


Hi Dhairya,

Thank you ever so much for having another look at this so quickly.
I don't think I have any logs similar to the ones you referenced
this time as my MDSs don't seem to enter the replay stage when
they crash (or at least don't now after I've thrown the logs away)
but those errors do crop up in the prior logs I shared when the
system first crashed.

Kindest regards,

Ivan

On 08/07/2024 14:08, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the
sender and know the content is safe.
If you think this is a phishing email, please forward it to
phish...@mrc-lmb.cam.ac.uk


--

Ugh, something went horribly wrong. I've downloaded the MDS logs
that contain assertion failure and it looks relevant to this [0].
Do you have client logs for this?

The other log that you shared is being downloaded right now, once
that's done and I'm done going through it, I'll update you.

[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
 wrote:

Hi Dhairya,

Sorry to resurrect this thread again, but we still
unfortunately have an issue with our filesystem after we
attempted to write new backups to it.

We finished the scrub of the filesystem on Friday and ran a
repair scrub on the 1 directory which had metadata damage.
After doing so and rebooting, the cluster reported no issues
and data was accessible again.

We re-started the backups to run over the weekend and
unfortunately the filesystem crashed again where the log of
the failure is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
We ran the backups on kernel mounts of the filesystem without
the nowsync option this time to avoid the out-of-sync write
problems..

I've tried resetting the journal again after recovering the
dentries but unfortunately the filesystem is still in a
failed state despite setting joinable to true. The log of
this crash is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

I'm not sure how to proceed as I can't seem to get any MDS to
take over the first rank. I would like to do a scrub of the
filesystem and preferably overwrite the troublesome files
with the originals on the live filesystem. Do you have any
advice on how to make the filesystem leave its failed state?
I have a backup of the journal before I reset it so I can
roll back if necessary.

Here are some details about the filesystem at present:

root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
flag(s) set
    1750 pgs not deep-scru

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-09 Thread Dhairya Parmar
Hey Ivan,

This is a relatively new MDS crash, so this would require some
investigation but I was instructed to recommend disaster-recovery steps [0]
(except session reset) to you to get the FS up again.

This crash is being discussed on upstream CephFS slack channel [1] with @Venky
Shankar  and other CephFS devs. I'd encourage you to
join the conversation, we can discuss this in detail and maybe go through
the incident step by step which should help analyse the crash better.

[0]
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
[1] https://ceph-storage.slack.com/archives/C04LVQMHM9B/p1720443057919519

On Mon, Jul 8, 2024 at 7:37 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> Thank you ever so much for having another look at this so quickly. I don't
> think I have any logs similar to the ones you referenced this time as my
> MDSs don't seem to enter the replay stage when they crash (or at least
> don't now after I've thrown the logs away) but those errors do crop up in
> the prior logs I shared when the system first crashed.
>
> Kindest regards,
>
> Ivan
> On 08/07/2024 14:08, Dhairya Parmar wrote:
>
> CAUTION: This email originated from outside of the LMB:
> *.-dpar...@redhat.com-.*
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe.
> If you think this is a phishing email, please forward it to
> phish...@mrc-lmb.cam.ac.uk
>
>
> --
> Ugh, something went horribly wrong. I've downloaded the MDS logs that
> contain assertion failure and it looks relevant to this [0]. Do you have
> client logs for this?
>
> The other log that you shared is being downloaded right now, once that's
> done and I'm done going through it, I'll update you.
>
> [0] https://tracker.ceph.com/issues/54546
>
> On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson 
> wrote:
>
>> Hi Dhairya,
>>
>> Sorry to resurrect this thread again, but we still unfortunately have an
>> issue with our filesystem after we attempted to write new backups to it.
>>
>> We finished the scrub of the filesystem on Friday and ran a repair scrub
>> on the 1 directory which had metadata damage. After doing so and rebooting,
>> the cluster reported no issues and data was accessible again.
>>
>> We re-started the backups to run over the weekend and unfortunately the
>> filesystem crashed again where the log of the failure is here:
>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
>> We ran the backups on kernel mounts of the filesystem without the nowsync
>> option this time to avoid the out-of-sync write problems..
>>
>> I've tried resetting the journal again after recovering the dentries but
>> unfortunately the filesystem is still in a failed state despite setting
>> joinable to true. The log of this crash is here:
>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
>> .
>>
>> I'm not sure how to proceed as I can't seem to get any MDS to take over
>> the first rank. I would like to do a scrub of the filesystem and preferably
>> overwrite the troublesome files with the originals on the live filesystem.
>> Do you have any advice on how to make the filesystem leave its failed
>> state? I have a backup of the journal before I reset it so I can roll back
>> if necessary.
>>
>> Here are some details about the filesystem at present:
>>
>> root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
>>   cluster:
>> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>> health: HEALTH_ERR
>> 1 filesystem is degraded
>> 1 large omap objects
>> 1 filesystem is offline
>> 1 mds daemon damaged
>>
>> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set
>> 1750 pgs not deep-scrubbed in time
>> 1612 pgs not scrubbed in time
>>
>>   services:
>> mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
>> (age 50m)
>> mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3,
>> pebbles-s4
>> mds: 1/2 daemons up, 3 standby
>> osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped
>> pgs
>>  flags
>> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
>>
>>   data:
>> volumes: 1/2 healthy, 1 recovering; 1 damaged
>> pools:   7 pools, 2177 pgs
>> objects: 3.24G objects, 6.7 PiB
>> usage:   8.6 PiB used, 14 PiB / 23 PiB avail
>> pgs: 11785954/27384310061 objects misplaced (0.043%)
>>  2167 active+clean
>>  6active+remapped+backfilling
>>  4active+remapped+backfill_wait
>>
>> ceph_backup - 0 clients
>> ===
>> RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
>>  0failed
>> POOLTYPE USED  AVAIL
>>mds_backup_fs  metadata  1174G  3071G
>> ec82_primary_fs_datadata   0   3071G
>>   ec82pool  data8085T  4738T
>> ceph_archive - 2 clien

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Ivan Clayson

Hi Dhairya,

Thank you ever so much for having another look at this so quickly. I 
don't think I have any logs similar to the ones you referenced this time 
as my MDSs don't seem to enter the replay stage when they crash (or at 
least don't now after I've thrown the logs away) but those errors do 
crop up in the prior logs I shared when the system first crashed.


Kindest regards,

Ivan

On 08/07/2024 14:08, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Ugh, something went horribly wrong. I've downloaded the MDS logs that 
contain assertion failure and it looks relevant to this [0]. Do you 
have client logs for this?


The other log that you shared is being downloaded right now, once 
that's done and I'm done going through it, I'll update you.


[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson  
wrote:


Hi Dhairya,

Sorry to resurrect this thread again, but we still unfortunately
have an issue with our filesystem after we attempted to write new
backups to it.

We finished the scrub of the filesystem on Friday and ran a repair
scrub on the 1 directory which had metadata damage. After doing so
and rebooting, the cluster reported no issues and data was
accessible again.

We re-started the backups to run over the weekend and
unfortunately the filesystem crashed again where the log of the
failure is here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
We ran the backups on kernel mounts of the filesystem without the
nowsync option this time to avoid the out-of-sync write problems..

I've tried resetting the journal again after recovering the
dentries but unfortunately the filesystem is still in a failed
state despite setting joinable to true. The log of this crash is
here:

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.

I'm not sure how to proceed as I can't seem to get any MDS to take
over the first rank. I would like to do a scrub of the filesystem
and preferably overwrite the troublesome files with the originals
on the live filesystem. Do you have any advice on how to make the
filesystem leave its failed state? I have a backup of the journal
before I reset it so I can roll back if necessary.

Here are some details about the filesystem at present:

root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
flag(s) set
    1750 pgs not deep-scrubbed in time
    1612 pgs not scrubbed in time

  services:
    mon: 4 daemons, quorum
pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
    mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
pebbles-s3, pebbles-s4
    mds: 1/2 daemons up, 3 standby
    osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d);
10 remapped pgs
 flags
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

  data:
    volumes: 1/2 healthy, 1 recovering; 1 damaged
    pools:   7 pools, 2177 pgs
    objects: 3.24G objects, 6.7 PiB
    usage:   8.6 PiB used, 14 PiB / 23 PiB avail
    pgs: 11785954/27384310061 objects misplaced (0.043%)
 2167 active+clean
 6    active+remapped+backfilling
 4    active+remapped+backfill_wait

ceph_backup - 0 clients
===
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
    POOL    TYPE USED  AVAIL
   mds_backup_fs  metadata  1174G  3071G
ec82_primary_fs_data    data   0   3071G
  ec82pool  data    8085T  4738T
ceph_archive - 2 clients

RANK  STATE  MDS ACTIVITY DNS    INOS DIRS   CAPS
 0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105 118  2
    POOL    TYPE USED  AVAIL
   mds_archive_fs metadata  5184M  3071G
ec83_primary_fs_data    data   0   3071G
  ec83pool  data 138T  4307T
STANDBY MDS
 pebbles-s2
 pebbles-s3
 pebbles-s1
MDS version: ceph version 17.2.7
(b12291d110049b2f35

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Dhairya Parmar
Ugh, something went horribly wrong. I've downloaded the MDS logs that
contain assertion failure and it looks relevant to this [0]. Do you have
client logs for this?

The other log that you shared is being downloaded right now, once that's
done and I'm done going through it, I'll update you.

[0] https://tracker.ceph.com/issues/54546

On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> Sorry to resurrect this thread again, but we still unfortunately have an
> issue with our filesystem after we attempted to write new backups to it.
>
> We finished the scrub of the filesystem on Friday and ran a repair scrub
> on the 1 directory which had metadata damage. After doing so and rebooting,
> the cluster reported no issues and data was accessible again.
>
> We re-started the backups to run over the weekend and unfortunately the
> filesystem crashed again where the log of the failure is here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
> We ran the backups on kernel mounts of the filesystem without the nowsync
> option this time to avoid the out-of-sync write problems..
>
> I've tried resetting the journal again after recovering the dentries but
> unfortunately the filesystem is still in a failed state despite setting
> joinable to true. The log of this crash is here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708
> .
>
> I'm not sure how to proceed as I can't seem to get any MDS to take over
> the first rank. I would like to do a scrub of the filesystem and preferably
> overwrite the troublesome files with the originals on the live filesystem.
> Do you have any advice on how to make the filesystem leave its failed
> state? I have a backup of the journal before I reset it so I can roll back
> if necessary.
>
> Here are some details about the filesystem at present:
>
> root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
>   cluster:
> id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 large omap objects
> 1 filesystem is offline
> 1 mds daemon damaged
>
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set
> 1750 pgs not deep-scrubbed in time
> 1612 pgs not scrubbed in time
>
>   services:
> mon: 4 daemons, quorum pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
> (age 50m)
> mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1, pebbles-s3,
> pebbles-s4
> mds: 1/2 daemons up, 3 standby
> osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10 remapped
> pgs
>  flags
> nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
>
>   data:
> volumes: 1/2 healthy, 1 recovering; 1 damaged
> pools:   7 pools, 2177 pgs
> objects: 3.24G objects, 6.7 PiB
> usage:   8.6 PiB used, 14 PiB / 23 PiB avail
> pgs: 11785954/27384310061 objects misplaced (0.043%)
>  2167 active+clean
>  6active+remapped+backfilling
>  4active+remapped+backfill_wait
>
> ceph_backup - 0 clients
> ===
> RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
>  0failed
> POOLTYPE USED  AVAIL
>mds_backup_fs  metadata  1174G  3071G
> ec82_primary_fs_datadata   0   3071G
>   ec82pool  data8085T  4738T
> ceph_archive - 2 clients
> 
> RANK  STATE  MDS ACTIVITY DNSINOS   DIRS   CAPS
>  0active  pebbles-s4  Reqs:0 /s  13.4k  7105118  2
> POOLTYPE USED  AVAIL
>mds_archive_fs metadata  5184M  3071G
> ec83_primary_fs_datadata   0   3071G
>   ec83pool  data 138T  4307T
> STANDBY MDS
>  pebbles-s2
>  pebbles-s3
>  pebbles-s1
> MDS version: ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> root@pebbles-s2 11:55 [~]: ceph fs dump
> e2643889
> enable_multiple, ever_enabled_multiple: 1,1
> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: 1
>
> Filesystem 'ceph_backup' (1)
> fs_nameceph_backup
> epoch2643888
> flags12 joinable allow_snaps allow_multimds_snaps
> created2023-05-19T12:52:36.302135+0100
> modified2024-07-08T11:17:55.437861+0100
> tableserver0
> root0
> session_timeout60
> session_autoclose300
> max_file_size10993418240
> required_client_features{}
> last_failure0
> last_failure_osd_epoch494515
> compatcompat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inlin

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-07-08 Thread Ivan Clayson

Hi Dhairya,

Sorry to resurrect this thread again, but we still unfortunately have an 
issue with our filesystem after we attempted to write new backups to it.


We finished the scrub of the filesystem on Friday and ran a repair scrub 
on the 1 directory which had metadata damage. After doing so and 
rebooting, the cluster reported no issues and data was accessible again.


We re-started the backups to run over the weekend and unfortunately the 
filesystem crashed again where the log of the failure is here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz. 
We ran the backups on kernel mounts of the filesystem without the 
nowsync option this time to avoid the out-of-sync write problems..


I've tried resetting the journal again after recovering the dentries but 
unfortunately the filesystem is still in a failed state despite setting 
joinable to true. The log of this crash is here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.


I'm not sure how to proceed as I can't seem to get any MDS to take over 
the first rank. I would like to do a scrub of the filesystem and 
preferably overwrite the troublesome files with the originals on the 
live filesystem. Do you have any advice on how to make the filesystem 
leave its failed state? I have a backup of the journal before I reset it 
so I can roll back if necessary.


Here are some details about the filesystem at present:

   root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
  cluster:
    id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
    health: HEALTH_ERR
    1 filesystem is degraded
    1 large omap objects
    1 filesystem is offline
    1 mds daemon damaged
   nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
   flag(s) set
    1750 pgs not deep-scrubbed in time
    1612 pgs not scrubbed in time

  services:
    mon: 4 daemons, quorum
   pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
    mgr: pebbles-s2(active, since 77m), standbys: pebbles-s1,
   pebbles-s3, pebbles-s4
    mds: 1/2 daemons up, 3 standby
    osd: 1380 osds: 1380 up (since 76m), 1379 in (since 10d); 10
   remapped pgs
 flags
   nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim

  data:
    volumes: 1/2 healthy, 1 recovering; 1 damaged
    pools:   7 pools, 2177 pgs
    objects: 3.24G objects, 6.7 PiB
    usage:   8.6 PiB used, 14 PiB / 23 PiB avail
    pgs: 11785954/27384310061 objects misplaced (0.043%)
 2167 active+clean
 6    active+remapped+backfilling
 4    active+remapped+backfill_wait

   ceph_backup - 0 clients
   ===
   RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
    POOL    TYPE USED  AVAIL
   mds_backup_fs  metadata  1174G  3071G
   ec82_primary_fs_data    data   0   3071G
  ec82pool  data    8085T  4738T
   ceph_archive - 2 clients
   
   RANK  STATE  MDS ACTIVITY DNS    INOS   DIRS CAPS
 0    active  pebbles-s4  Reqs:    0 /s  13.4k  7105    118 2
    POOL    TYPE USED  AVAIL
   mds_archive_fs metadata  5184M  3071G
   ec83_primary_fs_data    data   0   3071G
  ec83pool  data 138T  4307T
   STANDBY MDS
 pebbles-s2
 pebbles-s3
 pebbles-s1
   MDS version: ceph version 17.2.7
   (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
   root@pebbles-s2 11:55 [~]: ceph fs dump
   e2643889
   enable_multiple, ever_enabled_multiple: 1,1
   default compat: compat={},rocompat={},incompat={1=base
   v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir
   inode in separate object,5=mds uses versioned encoding,6=dirfrag is
   stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
   legacy client fscid: 1

   Filesystem 'ceph_backup' (1)
   fs_name    ceph_backup
   epoch    2643888
   flags    12 joinable allow_snaps allow_multimds_snaps
   created    2023-05-19T12:52:36.302135+0100
   modified    2024-07-08T11:17:55.437861+0100
   tableserver    0
   root    0
   session_timeout    60
   session_autoclose    300
   max_file_size    10993418240
   required_client_features    {}
   last_failure    0
   last_failure_osd_epoch    494515
   compat    compat={},rocompat={},incompat={1=base v0.20,2=client
   writeable ranges,3=default file layouts on dirs,4=dir inode in
   separate object,5=mds uses versioned encoding,6=dirfrag is stored in
   omap,7=mds uses inline data,8=no anchor table,9=file layout
   v2,10=snaprealm v2}
   max_mds    1
   in    0
   up    {}
   failed
   damaged    0
   stopped
   data_pools    [6,3]
   metadata_pool    2
   inline_data    disabled
   balancer
   standby_count_wanted    1


Kindest regards,

Ivan

On 28/06/2024 15:17, Dhairya Parmar wrot

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-28 Thread Dhairya Parmar
On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> I would be more than happy to share our corrupted journal. Has the host
> key changed for drop.ceph.com? The fingerprint I'm being sent is
> 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the one
> in our /usr/share/ceph/known_hosts_drop.ceph.com.
>
Ah, strange. Let me get in touch with folks who might know about this, will
revert back to you ASAP

> Thank you for your advice as well. We've reset our MDS' journal and are
> currently in the process of a full filesystem scrub which understandably is
> taking quite a bit of time but seems to be progressing through the objects
> fine.
>
YAY!

> Thank you ever so much for all your help and please do feel free to follow
> up with us if you would like any further details about our crash!
>
Glad to hear it went well, this bug is being worked on with high priority
and once the patch is ready, it will be backported.

The root cause of this issue is the `nowsync` (async dirops) being enabled
by default with kclient [0]. This feature allows asynchronous creation and
deletion of files, optimizing performance by avoiding round-trip latency
for these system calls. However, in very rare cases (like yours :D), it can
affect the system's consistency and stability hence if this kind of
optimization is not a priority for your workload, I recommend turning it
off by switching the mount points to `wsync` and also set the MDS config
`mds_client_delegate_inos_pct` to `0` so that you don't end up in this
situation again (until the bug fix arrives :)).

[0]
https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902

> Kindest regards,
>
> Ivan
> On 27/06/2024 12:39, Dhairya Parmar wrote:
>
> CAUTION: This email originated from outside of the LMB:
> *.-dpar...@redhat.com-.*
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe.
> If you think this is a phishing email, please forward it to
> phish...@mrc-lmb.cam.ac.uk
>
>
> --
> Hi Ivan,
>
> The solution (which has been successful for us in the past) is to reset
> the journal. This would bring the fs back online and return the MDSes to a
> stable state, but some data would be lost—the data in the journal that
> hasn't been flushed to the backing store would be gone. Therefore, you
> should try to flush out as much journal data as possible before resetting
> the journal.
>
> Here are the steps for this entire process:
>
> 1) Bring the FS offline
> $ ceph fs fail 
>
> 2) Recover dentries from journal (run it with every MDS Rank)
> $ cephfs-journal-tool --rank=: event recover_dentries
> summary
>
> 3) Reset the journal (again with every MDS Rank)
> $ cephfs-journal-tool --rank=: journal reset
>
> 4) Bring the FS online
> $ cephfs fs set  joinable true
>
> 5) Restart the MDSes
>
> 6) Perform scrub to ensure consistency of fs
> $ ceph tell mds.:0 scrub start  [scrubopts] [tag]
> # you could try a recursive scrub maybe `ceph tell mds.:0 scrub
> start / recursive`
>
> Some important notes to keep in mind:
> * Recovering dentries will take time (generally, rank 0 is the most
> time-consuming, but the rest should be quick).
> * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU
> percentage. This is because cephfs-journal-tool has to swig the journal
> data and flush it out to the backing store, which also makes the metadata
> operations go rampant, resulting in OSDs taking a significant percentage of
> CPU.
>
> Do let me know how this goes.
>
> On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson 
> wrote:
>
>> Hi Dhairya,
>>
>> We can induce the crash by simply restarting the MDS and the crash seems
>> to happen when an MDS goes from up:standby to up:replay. The MDS works
>> through a few files in the log before eventually crashing where I've
>> included the logs for this here (this is after I imported the backed up
>> journal which I hope was successful but please let me know if you suspect
>> it wasn't!):
>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
>>
>> With respect to the client logs, are you referring to the clients who are
>> writing to the filesystem? We don't typically run them in any sort of debug
>> mode and we have quite a few machines running our backup system but we can
>> look an hour or so before the first MDS crash (though I don't know if this
>> is when the de-sync occurred). Here are some MDS logs with regards to the
>> initial crash on Saturday morning though which may be helpful:
>>
>>-59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
>>-58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
>> _check_auth_rotating have uptodate secrets (they expire after
>> 2024-06-22T05:41:13.091556+0100)
>>-57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
>> Updating MDS map to version 2529650 from mon.3
>>-56> 2024-06-22T05:41:43.208+0100 7f184de84700

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-28 Thread Ivan Clayson

Hi Dhairya,

I would be more than happy to share our corrupted journal. Has the host 
key changed for drop.ceph.com? The fingerprint I'm being sent is 
7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is different to the 
one in our /usr/share/ceph/known_hosts_drop.ceph.com.


Thank you for your advice as well. We've reset our MDS' journal and are 
currently in the process of a full filesystem scrub which understandably 
is taking quite a bit of time but seems to be progressing through the 
objects fine.


Thank you ever so much for all your help and please do feel free to 
follow up with us if you would like any further details about our crash!


Kindest regards,

Ivan

On 27/06/2024 12:39, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Hi Ivan,

The solution (which has been successful for us in the past) is to 
reset the journal. This would bring the fs back online and return the 
MDSes to a stable state, but some data would be lost—the data in the 
journal that hasn't been flushed to the backing store would be gone. 
Therefore, you should try to flush out as much journal data as 
possible before resetting the journal.


Here are the steps for this entire process:

1) Bring the FS offline
$ ceph fs fail 

2) Recover dentries from journal (run it with every MDS Rank)
$ cephfs-journal-tool --rank=: event 
recover_dentries summary


3) Reset the journal (again with every MDS Rank)
$ cephfs-journal-tool --rank=: journal reset

4) Bring the FS online
$ cephfs fs set  joinable true

5) Restart the MDSes

6) Perform scrub to ensure consistency of fs
$ ceph tell mds.:0 scrub start  [scrubopts] [tag]
# you could try a recursive scrub maybe `ceph tell mds.:0 
scrub start / recursive`


Some important notes to keep in mind:
* Recovering dentries will take time (generally, rank 0 is the most 
time-consuming, but the rest should be quick).
* cephfs-journal-tool and metadata OSDs are bound to use a significant 
CPU percentage. This is because cephfs-journal-tool has to swig the 
journal data and flush it out to the backing store, which also makes 
the metadata operations go rampant, resulting in OSDs taking a 
significant percentage of CPU.


Do let me know how this goes.

On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson  
wrote:


Hi Dhairya,

We can induce the crash by simply restarting the MDS and the crash
seems to happen when an MDS goes from up:standby to up:replay. The
MDS works through a few files in the log before eventually
crashing where I've included the logs for this here (this is after
I imported the backed up journal which I hope was successful but
please let me know if you suspect it wasn't!):

https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log

With respect to the client logs, are you referring to the clients
who are writing to the filesystem? We don't typically run them in
any sort of debug mode and we have quite a few machines running
our backup system but we can look an hour or so before the first
MDS crash (though I don't know if this is when the de-sync
occurred). Here are some MDS logs with regards to the initial
crash on Saturday morning though which may be helpful:

   -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
monclient: tick
   -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10
monclient: _check_auth_rotating have uptodate secrets (they
expire after 2024-06-22T05:41:13.091556+0100)
   -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1
mds.pebbles-s2 Updating MDS map to version 2529650 from mon.3
   -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
mds.0.purge_queue operator():  data pool 6 not found in OSDMap
   -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
mds.0.purge_queue operator():  data pool 3 not found in OSDMap
   -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
asok(0x5592e7968000) register_command objecter_requests hook
0x5592e78f8800
   -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10
monclient: _renew_subs
   -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10
monclient: _send_mon_message to mon.pebbles-s4 at
v2:10.1.5.134:3300/0 
   -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
log_channel(cluster) update_config to_monitors: true
to_syslog: false syslog_facility:  prio: info to_graylog:
false graylog_host: 127.0.0.1 graylog_port: 12201)
   -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
mds.0.purge_queue operator():  data pool 6 not found in OSDMap
   -49> 2024-06-22T05:41:43.209+01

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-27 Thread Dhairya Parmar
Ivan, before resetting the journal, could you take the backup of your
journal using `cephfs-journal-tool export` [0] and send it to us through
`ceph-post-file` [1] or any other means you're comfortable with?

[0]
https://docs.ceph.com/en/latest/cephfs/cephfs-journal-tool/#example-journal-import-export
[1] https://docs.ceph.com/en/latest/man/8/ceph-post-file

On Thu, Jun 27, 2024 at 5:09 PM Dhairya Parmar  wrote:

> Hi Ivan,
>
> The solution (which has been successful for us in the past) is to reset
> the journal. This would bring the fs back online and return the MDSes to a
> stable state, but some data would be lost—the data in the journal that
> hasn't been flushed to the backing store would be gone. Therefore, you
> should try to flush out as much journal data as possible before resetting
> the journal.
>
> Here are the steps for this entire process:
>
> 1) Bring the FS offline
> $ ceph fs fail 
>
> 2) Recover dentries from journal (run it with every MDS Rank)
> $ cephfs-journal-tool --rank=: event recover_dentries
> summary
>
> 3) Reset the journal (again with every MDS Rank)
> $ cephfs-journal-tool --rank=: journal reset
>
> 4) Bring the FS online
> $ cephfs fs set  joinable true
>
> 5) Restart the MDSes
>
> 6) Perform scrub to ensure consistency of fs
> $ ceph tell mds.:0 scrub start  [scrubopts] [tag]
> # you could try a recursive scrub maybe `ceph tell mds.:0 scrub
> start / recursive`
>
> Some important notes to keep in mind:
> * Recovering dentries will take time (generally, rank 0 is the most
> time-consuming, but the rest should be quick).
> * cephfs-journal-tool and metadata OSDs are bound to use a significant CPU
> percentage. This is because cephfs-journal-tool has to swig the journal
> data and flush it out to the backing store, which also makes the metadata
> operations go rampant, resulting in OSDs taking a significant percentage of
> CPU.
>
> Do let me know how this goes.
>
> On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson 
> wrote:
>
>> Hi Dhairya,
>>
>> We can induce the crash by simply restarting the MDS and the crash seems
>> to happen when an MDS goes from up:standby to up:replay. The MDS works
>> through a few files in the log before eventually crashing where I've
>> included the logs for this here (this is after I imported the backed up
>> journal which I hope was successful but please let me know if you suspect
>> it wasn't!):
>> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
>>
>> With respect to the client logs, are you referring to the clients who are
>> writing to the filesystem? We don't typically run them in any sort of debug
>> mode and we have quite a few machines running our backup system but we can
>> look an hour or so before the first MDS crash (though I don't know if this
>> is when the de-sync occurred). Here are some MDS logs with regards to the
>> initial crash on Saturday morning though which may be helpful:
>>
>>-59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
>>-58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
>> _check_auth_rotating have uptodate secrets (they expire after
>> 2024-06-22T05:41:13.091556+0100)
>>-57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
>> Updating MDS map to version 2529650 from mon.3
>>-56> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 6 not found in OSDMap
>>-55> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 3 not found in OSDMap
>>-54> 2024-06-22T05:41:43.209+0100 7f184de84700  5 asok(0x5592e7968000)
>> register_command objecter_requests hook 0x5592e78f8800
>>-53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>> _renew_subs
>>-52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
>> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>>-51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 log_channel(cluster)
>> update_config to_monitors: true to_syslog: false syslog_facility:  prio:
>> info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201)
>>-50> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 6 not found in OSDMap
>>-49> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
>> operator():  data pool 3 not found in OSDMap
>>-48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
>> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
>>-47> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> handle_mds_map i am now mds.0.2529650
>>-46> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> handle_mds_map state change up:standby --> up:replay
>>-45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
>> mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
>>-44> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
>> replay_start
>>-43> 2024-06-22T05:41:43.209+0100 

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-27 Thread Dhairya Parmar
Hi Ivan,

The solution (which has been successful for us in the past) is to reset the
journal. This would bring the fs back online and return the MDSes to a
stable state, but some data would be lost—the data in the journal that
hasn't been flushed to the backing store would be gone. Therefore, you
should try to flush out as much journal data as possible before resetting
the journal.

Here are the steps for this entire process:

1) Bring the FS offline
$ ceph fs fail 

2) Recover dentries from journal (run it with every MDS Rank)
$ cephfs-journal-tool --rank=: event recover_dentries
summary

3) Reset the journal (again with every MDS Rank)
$ cephfs-journal-tool --rank=: journal reset

4) Bring the FS online
$ cephfs fs set  joinable true

5) Restart the MDSes

6) Perform scrub to ensure consistency of fs
$ ceph tell mds.:0 scrub start  [scrubopts] [tag]
# you could try a recursive scrub maybe `ceph tell mds.:0 scrub
start / recursive`

Some important notes to keep in mind:
* Recovering dentries will take time (generally, rank 0 is the most
time-consuming, but the rest should be quick).
* cephfs-journal-tool and metadata OSDs are bound to use a significant CPU
percentage. This is because cephfs-journal-tool has to swig the journal
data and flush it out to the backing store, which also makes the metadata
operations go rampant, resulting in OSDs taking a significant percentage of
CPU.

Do let me know how this goes.

On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> We can induce the crash by simply restarting the MDS and the crash seems
> to happen when an MDS goes from up:standby to up:replay. The MDS works
> through a few files in the log before eventually crashing where I've
> included the logs for this here (this is after I imported the backed up
> journal which I hope was successful but please let me know if you suspect
> it wasn't!):
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
>
> With respect to the client logs, are you referring to the clients who are
> writing to the filesystem? We don't typically run them in any sort of debug
> mode and we have quite a few machines running our backup system but we can
> look an hour or so before the first MDS crash (though I don't know if this
> is when the de-sync occurred). Here are some MDS logs with regards to the
> initial crash on Saturday morning though which may be helpful:
>
>-59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
>-58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
> _check_auth_rotating have uptodate secrets (they expire after
> 2024-06-22T05:41:13.091556+0100)
>-57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
> Updating MDS map to version 2529650 from mon.3
>-56> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
> operator():  data pool 6 not found in OSDMap
>-55> 2024-06-22T05:41:43.208+0100 7f184de84700  4 mds.0.purge_queue
> operator():  data pool 3 not found in OSDMap
>-54> 2024-06-22T05:41:43.209+0100 7f184de84700  5 asok(0x5592e7968000)
> register_command objecter_requests hook 0x5592e78f8800
>-53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient: _renew_subs
>-52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>-51> 2024-06-22T05:41:43.209+0100 7f184de84700 10 log_channel(cluster)
> update_config to_monitors: true to_syslog: false syslog_facility:  prio:
> info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201)
>-50> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
> operator():  data pool 6 not found in OSDMap
>-49> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.purge_queue
> operator():  data pool 3 not found in OSDMap
>-48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
> apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
>-47> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
> handle_mds_map i am now mds.0.2529650
>-46> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
> handle_mds_map state change up:standby --> up:replay
>-45> 2024-06-22T05:41:43.209+0100 7f184de84700  5 mds.beacon.pebbles-s2
> set_want_state: up:standby -> up:replay
>-44> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
> replay_start
>-43> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
> waiting for osdmap 473739 (which blocklists prior instance)
>-42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
> _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
>-41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2 mds.0.cache Memory
> usage:  total 299012, rss 37624, heap 182556, baseline 182556, 0 / 0 inodes
> have caps, 0 caps, 0 caps per inode
>-40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient: _renew_subs
>-39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 moncl

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-27 Thread Ivan Clayson

Hi Dhairya,

We can induce the crash by simply restarting the MDS and the crash seems 
to happen when an MDS goes from up:standby to up:replay. The MDS works 
through a few files in the log before eventually crashing where I've 
included the logs for this here (this is after I imported the backed up 
journal which I hope was successful but please let me know if you 
suspect it wasn't!): 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log


With respect to the client logs, are you referring to the clients who 
are writing to the filesystem? We don't typically run them in any sort 
of debug mode and we have quite a few machines running our backup system 
but we can look an hour or so before the first MDS crash (though I don't 
know if this is when the de-sync occurred). Here are some MDS logs with 
regards to the initial crash on Saturday morning though which may be 
helpful:


   -59> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient: tick
   -58> 2024-06-22T05:41:43.090+0100 7f184ce82700 10 monclient:
   _check_auth_rotating have uptodate secrets (they expire after
   2024-06-22T05:41:13.091556+0100)
   -57> 2024-06-22T05:41:43.208+0100 7f184de84700  1 mds.pebbles-s2
   Updating MDS map to version 2529650 from mon.3
   -56> 2024-06-22T05:41:43.208+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 6 not found in OSDMap
   -55> 2024-06-22T05:41:43.208+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 3 not found in OSDMap
   -54> 2024-06-22T05:41:43.209+0100 7f184de84700  5
   asok(0x5592e7968000) register_command objecter_requests hook
   0x5592e78f8800
   -53> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
   _renew_subs
   -52> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
   _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
   -51> 2024-06-22T05:41:43.209+0100 7f184de84700 10
   log_channel(cluster) update_config to_monitors: true to_syslog:
   false syslog_facility:  prio: info to_graylog: false graylog_host:
   127.0.0.1 graylog_port: 12201)
   -50> 2024-06-22T05:41:43.209+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 6 not found in OSDMap
   -49> 2024-06-22T05:41:43.209+0100 7f184de84700  4
   mds.0.purge_queue operator():  data pool 3 not found in OSDMap
   -48> 2024-06-22T05:41:43.209+0100 7f184de84700  4 mds.0.0
   apply_blocklist: killed 0, blocklisted sessions (0 blocklist entries, 0)
   -47> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
   handle_mds_map i am now mds.0.2529650
   -46> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
   handle_mds_map state change up:standby --> up:replay
   -45> 2024-06-22T05:41:43.209+0100 7f184de84700  5
   mds.beacon.pebbles-s2 set_want_state: up:standby -> up:replay
   -44> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650
   replay_start
   -43> 2024-06-22T05:41:43.209+0100 7f184de84700  1 mds.0.2529650 
   waiting for osdmap 473739 (which blocklists prior instance)
   -42> 2024-06-22T05:41:43.209+0100 7f184de84700 10 monclient:
   _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
   -41> 2024-06-22T05:41:43.209+0100 7f1849e7c700  2 mds.0.cache
   Memory usage:  total 299012, rss 37624, heap 182556, baseline
   182556, 0 / 0 inodes have caps, 0 caps, 0 caps per inode
   -40> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
   _renew_subs
   -39> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
   _send_mon_message to mon.pebbles-s4 at v2:10.1.5.134:3300/0
   -38> 2024-06-22T05:41:43.224+0100 7f184de84700 10 monclient:
   handle_get_version_reply finishing 1 version 473739
   -37> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening inotable
   -36> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening sessionmap
   -35> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening mds log
   -34> 2024-06-22T05:41:43.224+0100 7f1847e78700  5 mds.0.log open
   discovering log bounds
   -33> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening purge queue (async)
   -32> 2024-06-22T05:41:43.224+0100 7f1847e78700  4
   mds.0.purge_queue open: opening
   -31> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
   mds.0.journaler.pq(ro) recover start
   -30> 2024-06-22T05:41:43.224+0100 7f1847e78700  1
   mds.0.journaler.pq(ro) read_head
   -29> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: loading open file table (async)
   -28> 2024-06-22T05:41:43.224+0100 7f1847e78700  2 mds.0.2529650
   Booting: 0: opening snap table
   -27> 2024-06-22T05:41:43.224+0100 7f1847677700  4
   mds.0.journalpointer Reading journal pointer '400.'
   -26> 2024-06-22T05:41:43.224+0100 7f1850689700 10 monclient:
   get_auth_request con 

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-25 Thread Dhairya Parmar
On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson  wrote:

> Hi Dhairya,
>
> Thank you for your rapid reply. I tried recovering the dentries for the
> file just before the crash I mentioned before and then splicing the
> transactions from the journal which seemed to remove that issue for that
> inode but resulted in the MDS crashing on the next inode in the journal
> when performing replay.
>
The MDS delegates a range of preallocated inodes (in form of a set -
interval_set preallocated_inos) to the clients, so it can be one
inode that is untracked or some inodes from the range or in worst case
scenario - ALL, and this is something that even the `cephfs-journal-tool`
would not be able to tell (since we're talking about MDS internals which
aren't exposed to such tools). That is the reason why you see "MDS crashing
on the next inode in the journal when performing replay".

An option could be to expose the inode set to some tool or asok cmd to
identify such inodes ranges, which needs to be discussed. For now, we're
trying to address this in [0], you can follow the discussion there.

[0] https://tracker.ceph.com/issues/66251

> Removing all the transactions involving the directory housing the files
> that seemed to cause these crashes from the journal only caused the MDS to
> fail to even start replay.
>
I've rolled back our journal to our original version when the crash first
> happened and the entire MDS log for the crash can be found here:
> https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24
>
Awesome, this would help us a ton. Apart from this, would it be possible to
send us client logs?

> Please let us know if you would like any other logs file as we can easily
> induce this crash.
>
Since you can easily induce the crash, can you share the reproducer please
i.e. what all action you take in order to hit this?

> Kindest regards,
>
> Ivan
> On 25/06/2024 09:58, Dhairya Parmar wrote:
>
> CAUTION: This email originated from outside of the LMB:
> *.-dpar...@redhat.com-.*
> Do not click links or open attachments unless you recognize the sender and
> know the content is safe.
> If you think this is a phishing email, please forward it to
> phish...@mrc-lmb.cam.ac.uk
>
>
> --
> Hi Ivan,
>
> This looks to be similar to the issue [0] that we're already addressing at
> [1]. So basically there is some out-of-sync event that led the client to
> make use of the inodes that MDS wasn't aware of/isn't tracking and hence
> the crash. It'd be really helpful if you can provide us more logs.
>
> CC @Rishabh Dave  @Venky Shankar  
> @Patrick
> Donnelly  @Xiubo Li 
>
> [0] https://tracker.ceph.com/issues/61009
> [1] https://tracker.ceph.com/issues/66251
> --
> *Dhairya Parmar*
>
> Associate Software Engineer, CephFS
>
> IBM, Inc.
>
> On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson 
> wrote:
>
>> Hello,
>>
>> We have been experiencing a serious issue with our CephFS backup cluster
>> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
>> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are
>> constantly in a "replay" or "replay(laggy)" state and keep crashing.
>>
>> We have a single MDS filesystem called "ceph_backup" with 2 standby
>> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
>> little to no data) where we are using our "ceph_backup" filesystem to
>> backup our data and this is the one which is currently broken. The Ceph
>> health outputs currently are:
>>
>> root@pebbles-s1 14:05 [~]: ceph -s
>>cluster:
>>  id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>>  health: HEALTH_WARN
>>  1 filesystem is degraded
>>  insufficient standby MDS daemons available
>>  1319 pgs not deep-scrubbed in time
>>  1054 pgs not scrubbed in time
>>
>>services:
>>  mon: 4 daemons, quorum
>> pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
>>  mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
>> pebbles-s3, pebbles-s1
>>  mds: 2/2 daemons up
>>  osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
>> remapped pgs
>>
>>data:
>>  volumes: 1/2 healthy, 1 recovering
>>  pools:   7 pools, 2177 pgs
>>  objects: 3.55G objects, 7.0 PiB
>>  usage:   8.9 PiB used, 14 PiB / 23 PiB avail
>>  pgs: 83133528/30006841533 objects misplaced (0.277%)
>>   2090 active+clean
>>   47   active+clean+scrubbing+deep
>>   29   active+remapped+backfilling
>>   8active+remapped+backfill_wait
>>   2active+clean+scrubbing
>>   1active+clean+snaptrim
>>
>>io:
>>  recovery: 1.9 GiB/s, 719 objects/s
>>
>> root@pebbles-s1 14:09 [~]: ceph fs status
>> ceph_backup - 0 clients
>> ===
>> RANK  STATE 

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-25 Thread Ivan Clayson

Hi Dhairya,

Thank you for your rapid reply. I tried recovering the dentries for the 
file just before the crash I mentioned before and then splicing the 
transactions from the journal which seemed to remove that issue for that 
inode but resulted in the MDS crashing on the next inode in the journal 
when performing replay. Removing all the transactions involving the 
directory housing the files that seemed to cause these crashes from the 
journal only caused the MDS to fail to even start replay.


I've rolled back our journal to our original version when the crash 
first happened and the entire MDS log for the crash can be found here: 
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24


Please let us know if you would like any other logs file as we can 
easily induce this crash.


Kindest regards,

Ivan

On 25/06/2024 09:58, Dhairya Parmar wrote:

CAUTION: This email originated from outside of the LMB:
*.-dpar...@redhat.com-.*
Do not click links or open attachments unless you recognize the sender 
and know the content is safe.
If you think this is a phishing email, please forward it to 
phish...@mrc-lmb.cam.ac.uk



--

Hi Ivan,

This looks to be similar to the issue [0] that we're already 
addressing at [1]. So basically there is some out-of-sync event that 
led the client to make use of the inodes that MDS wasn't aware 
of/isn't tracking and hence the crash. It'd be really helpful if you 
can provide us more logs.


CC @Rishabh Dave  @Venky Shankar 
 @Patrick Donnelly 
 @Xiubo Li 


[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
***Dhairya Parmar*

Associate Software Engineer, CephFS

IBM, Inc.


On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson  
wrote:


Hello,

We have been experiencing a serious issue with our CephFS backup
cluster
running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
(Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our
filesystem are
constantly in a "replay" or "replay(laggy)" state and keep crashing.

We have a single MDS filesystem called "ceph_backup" with 2 standby
MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
little to no data) where we are using our "ceph_backup" filesystem to
backup our data and this is the one which is currently broken. The
Ceph
health outputs currently are:

    root@pebbles-s1 14:05 [~]: ceph -s
       cluster:
     id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
     health: HEALTH_WARN
     1 filesystem is degraded
     insufficient standby MDS daemons available
     1319 pgs not deep-scrubbed in time
     1054 pgs not scrubbed in time

       services:
     mon: 4 daemons, quorum
    pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
     mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
    pebbles-s3, pebbles-s1
     mds: 2/2 daemons up
     osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
    remapped pgs

       data:
     volumes: 1/2 healthy, 1 recovering
     pools:   7 pools, 2177 pgs
     objects: 3.55G objects, 7.0 PiB
     usage:   8.9 PiB used, 14 PiB / 23 PiB avail
     pgs: 83133528/30006841533 objects misplaced (0.277%)
      2090 active+clean
      47   active+clean+scrubbing+deep
      29   active+remapped+backfilling
      8    active+remapped+backfill_wait
      2    active+clean+scrubbing
      1    active+clean+snaptrim

       io:
     recovery: 1.9 GiB/s, 719 objects/s

    root@pebbles-s1 14:09 [~]: ceph fs status
    ceph_backup - 0 clients
    ===
    RANK  STATE MDS  ACTIVITY   DNS    INOS DIRS CAPS
      0    replay(laggy)  pebbles-s3   0  0 0  0
     POOL    TYPE USED  AVAIL
        mds_backup_fs  metadata  1255G  2780G
    ec82_primary_fs_data    data   0   2780G
       ec82pool  data    8442T  3044T
    ceph_archive - 2 clients
    
    RANK  STATE  MDS ACTIVITY DNS    INOS DIRS CAPS
      0    active  pebbles-s2  Reqs:    0 /s  13.4k  7105 118 2
     POOL    TYPE USED  AVAIL
        mds_archive_fs metadata  5184M  2780G
    ec83_primary_fs_data    data   0   2780G
       ec83pool  data 138T  2767T
    MDS version: ceph version 17.2.7
    (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
    root@pebbles-s1 14:09 [~]: ceph health detail | head
    HEALTH_WARN 1 file

[ceph-users] Re: CephFS MDS crashing during replay with standby MDSes crashing afterwards

2024-06-25 Thread Dhairya Parmar
Hi Ivan,

This looks to be similar to the issue [0] that we're already addressing at
[1]. So basically there is some out-of-sync event that led the client to
make use of the inodes that MDS wasn't aware of/isn't tracking and hence
the crash. It'd be really helpful if you can provide us more logs.

CC @Rishabh Dave  @Venky Shankar
 @Patrick
Donnelly  @Xiubo Li 

[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
*Dhairya Parmar*

Associate Software Engineer, CephFS

IBM, Inc.

On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson  wrote:

> Hello,
>
> We have been experiencing a serious issue with our CephFS backup cluster
> running quincy (version 17.2.7) on a RHEL8-derivative Linux kernel
> (Alma8.9, 4.18.0-513.9.1 kernel) where our MDSes for our filesystem are
> constantly in a "replay" or "replay(laggy)" state and keep crashing.
>
> We have a single MDS filesystem called "ceph_backup" with 2 standby
> MDSes along with a 2nd unused filesystem "ceph_archive" (this holds
> little to no data) where we are using our "ceph_backup" filesystem to
> backup our data and this is the one which is currently broken. The Ceph
> health outputs currently are:
>
> root@pebbles-s1 14:05 [~]: ceph -s
>cluster:
>  id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
>  health: HEALTH_WARN
>  1 filesystem is degraded
>  insufficient standby MDS daemons available
>  1319 pgs not deep-scrubbed in time
>  1054 pgs not scrubbed in time
>
>services:
>  mon: 4 daemons, quorum
> pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 36m)
>  mgr: pebbles-s2(active, since 36m), standbys: pebbles-s4,
> pebbles-s3, pebbles-s1
>  mds: 2/2 daemons up
>  osd: 1380 osds: 1380 up (since 29m), 1379 in (since 3d); 37
> remapped pgs
>
>data:
>  volumes: 1/2 healthy, 1 recovering
>  pools:   7 pools, 2177 pgs
>  objects: 3.55G objects, 7.0 PiB
>  usage:   8.9 PiB used, 14 PiB / 23 PiB avail
>  pgs: 83133528/30006841533 objects misplaced (0.277%)
>   2090 active+clean
>   47   active+clean+scrubbing+deep
>   29   active+remapped+backfilling
>   8active+remapped+backfill_wait
>   2active+clean+scrubbing
>   1active+clean+snaptrim
>
>io:
>  recovery: 1.9 GiB/s, 719 objects/s
>
> root@pebbles-s1 14:09 [~]: ceph fs status
> ceph_backup - 0 clients
> ===
> RANK  STATE MDS  ACTIVITY   DNSINOS   DIRS CAPS
>   0replay(laggy)  pebbles-s3   0  0 0  0
>  POOLTYPE USED  AVAIL
> mds_backup_fs  metadata  1255G  2780G
> ec82_primary_fs_datadata   0   2780G
>ec82pool  data8442T  3044T
> ceph_archive - 2 clients
> 
> RANK  STATE  MDS ACTIVITY DNSINOS   DIRS CAPS
>   0active  pebbles-s2  Reqs:0 /s  13.4k  7105118 2
>  POOLTYPE USED  AVAIL
> mds_archive_fs metadata  5184M  2780G
> ec83_primary_fs_datadata   0   2780G
>ec83pool  data 138T  2767T
> MDS version: ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> root@pebbles-s1 14:09 [~]: ceph health detail | head
> HEALTH_WARN 1 filesystem is degraded; insufficient standby MDS
> daemons available; 1319 pgs not deep-scrubbed in time; 1054 pgs not
> scrubbed in time
> [WRN] FS_DEGRADED: 1 filesystem is degraded
>  fs ceph_backup is degraded
> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
> available
>  have 0; want 1 more
>
> When our cluster first ran after a reboot, Ceph ran through the 2
> standby MDSes, crashing them all, until it reached the final MDS and is
> now stuck in this "replay(laggy)" state. Putting our MDSes into
> debugging mode, we can see that this MDS crashed when replaying the
> journal for a particular inode (this is the same for all the MDSes and
> they all crash on the same object):
>
> ...
> 2024-06-24T13:44:55.563+0100 7f8811c40700 10 mds.0.journal
> EMetaBlob.replay for [521,head] had [inode 0x1005ba89481
> [...539,head]
>
> /cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
> auth fragtree_t(*^2 00*^3 0*^
> 4 1*^3 00010*^4 00011*^4 00100*^4 00101*^4 00110*^4 00111*^4
> 01*^3 01000*^4 01001*^3 01010*^4 01011*^3 01100*^4 01101*^4 01110*^4
> 0*^4 10*^3 1*^4 10001*^4 10010*^4 10011*^4 10100*^4 10101*^3
> 10110*^4 10111*^4 11*^6) v10880645 f(v0 m2024-06-22
> T05:41:10.213700+0100 1281276=1281276+0) n(v12
> rc2024-06-22T05:41:10.213700+0100 b1348251683896 1281277=12812