[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-08 Thread Janek Bevendorff
Wow! Distributed epins :) Thanks for trying it. How many sub-directories under the distributed epin'd directory? (There's a lot of stability problems that are to be fixed in Pacific associated with lots of subtrees so if you have too large of a directory, things could get ugly!) Yay, beta

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-07 Thread Patrick Donnelly
On Mon, Dec 7, 2020 at 1:28 PM Janek Bevendorff wrote: > > > > This sounds like there is one or a few clients acquiring too many > > caps. Have you checked this? Are there any messages about the OOM > > killer? What config changes for the MDS have you made? > > Yes, it's individual clients

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-07 Thread Janek Bevendorff
This sounds like there is one or a few clients acquiring too many caps. Have you checked this? Are there any messages about the OOM killer? What config changes for the MDS have you made? Yes, it's individual clients acquiring too my caps. I first ran the adjusted recall settings you

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-07 Thread Patrick Donnelly
On Sat, Dec 5, 2020 at 5:41 AM Janek Bevendorff wrote: > > On 05/12/2020 09:26, Dan van der Ster wrote: > > Hi Janek, > > > > I'd love to hear your standard maintenance procedures. Are you > > cleaning up those open files outside of "rejoin" OOMs ? > > No, of course not. But those rejoin problems

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-06 Thread Janek Bevendorff
(Only one of our test clusters saw this happen so far, during mimic days, and this provoked us to move all MDSs to 64GB VMs, with mds cache mem limit = 4GB, so there is a large amount of RAM available in case it's needed. Ours are running on machines with 128GB RAM. I tried limits between 4

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-05 Thread Dan van der Ster
On Sat, Dec 5, 2020 at 2:41 PM Janek Bevendorff wrote: > > On 05/12/2020 09:26, Dan van der Ster wrote: > > Hi Janek, > > > > I'd love to hear your standard maintenance procedures. Are you > > cleaning up those open files outside of "rejoin" OOMs ? > > No, of course not. But those rejoin problems

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-05 Thread Janek Bevendorff
On 05/12/2020 09:26, Dan van der Ster wrote: Hi Janek, I'd love to hear your standard maintenance procedures. Are you cleaning up those open files outside of "rejoin" OOMs ? No, of course not. But those rejoin problems happen more often than I'd like them to. It has become much better with

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-05 Thread Dan van der Ster
Hi Janek, I'd love to hear your standard maintenance procedures. Are you cleaning up those open files outside of "rejoin" OOMs ? I guess we're pretty lucky with our CephFS's because we have more than 1k clients and it is pretty solid (though the last upgrade had a hiccup decreasing down to

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Janek Bevendorff
This is very common issue. Deleting mdsX_openfiles.Y has become part of my standard maintenance repertoire. As soon as you have a few more clients and one of them starts opening and closing files in rapid succession (or does other metadata-heavy things), it becomes very likely that the MDS

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Dan van der Ster
Excellent! For the record, this PR is the plan to fix this: https://github.com/ceph/ceph/pull/36089 (nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382 https://github.com/ceph/ceph/pull/37383) Cheers, Dan On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov wrote: > > Thank you

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Anton Aleksandrov
Thank you very much! This solution helped: Stop all MDS, then: # rados -p cephfs_metadata_pool rm mds0_openfiles.0 then start one MDS. We are back online. Amazing!!! :) On 04.12.2020 12:20, Dan van der Ster wrote: Please also make sure the mds_beacon_grace is high on the mon's too. it

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Dan van der Ster
Please also make sure the mds_beacon_grace is high on the mon's too. it doesn't matter which mds you select to be the running one. Is the processing getting killed, restarted? If you're confident that the mds is getting OOM killed during rejoin step, then you might find this useful:

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Anton Aleksandrov
Yes, MDS eats all memory+swap, stays like this for a moment and then frees memory. mds_beacon_grace was already set to 1800 Also on other it is seen this message: Map has assigned me to become a standby. Does it matter, which MDS we stop and which we leave running? Anton On 04.12.2020

[ceph-users] Re: MDS lost, Filesystem degraded and wont mount

2020-12-04 Thread Dan van der Ster
How many active MDS's did you have? (max_mds == 1, right?) Stop the other two MDS's so you can focus on getting exactly one running. Tail the log file and see what it is reporting. Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS while it is rejoining. Is that single MDS