Wow! Distributed epins :) Thanks for trying it. How many
sub-directories under the distributed epin'd directory? (There's a lot
of stability problems that are to be fixed in Pacific associated with
lots of subtrees so if you have too large of a directory, things could
get ugly!)
Yay, beta
On Mon, Dec 7, 2020 at 1:28 PM Janek Bevendorff
wrote:
>
>
> > This sounds like there is one or a few clients acquiring too many
> > caps. Have you checked this? Are there any messages about the OOM
> > killer? What config changes for the MDS have you made?
>
> Yes, it's individual clients
This sounds like there is one or a few clients acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?
Yes, it's individual clients acquiring too my caps. I first ran the
adjusted recall settings you
On Sat, Dec 5, 2020 at 5:41 AM Janek Bevendorff
wrote:
>
> On 05/12/2020 09:26, Dan van der Ster wrote:
> > Hi Janek,
> >
> > I'd love to hear your standard maintenance procedures. Are you
> > cleaning up those open files outside of "rejoin" OOMs ?
>
> No, of course not. But those rejoin problems
(Only one of our test clusters saw this happen so far, during mimic
days, and this provoked us to move all MDSs to 64GB VMs, with mds
cache mem limit = 4GB, so there is a large amount of RAM available in
case it's needed.
Ours are running on machines with 128GB RAM. I tried limits between 4
On Sat, Dec 5, 2020 at 2:41 PM Janek Bevendorff
wrote:
>
> On 05/12/2020 09:26, Dan van der Ster wrote:
> > Hi Janek,
> >
> > I'd love to hear your standard maintenance procedures. Are you
> > cleaning up those open files outside of "rejoin" OOMs ?
>
> No, of course not. But those rejoin problems
On 05/12/2020 09:26, Dan van der Ster wrote:
Hi Janek,
I'd love to hear your standard maintenance procedures. Are you
cleaning up those open files outside of "rejoin" OOMs ?
No, of course not. But those rejoin problems happen more often than I'd
like them to. It has become much better with
Hi Janek,
I'd love to hear your standard maintenance procedures. Are you
cleaning up those open files outside of "rejoin" OOMs ?
I guess we're pretty lucky with our CephFS's because we have more than
1k clients and it is pretty solid (though the last upgrade had a
hiccup decreasing down to
This is very common issue. Deleting mdsX_openfiles.Y has become part of
my standard maintenance repertoire. As soon as you have a few more
clients and one of them starts opening and closing files in rapid
succession (or does other metadata-heavy things), it becomes very likely
that the MDS
Excellent!
For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)
Cheers, Dan
On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov wrote:
>
> Thank you
Thank you very much! This solution helped:
Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.
We are back online. Amazing!!! :)
On 04.12.2020 12:20, Dan van der Ster wrote:
Please also make sure the mds_beacon_grace is high on the mon's too.
it
Please also make sure the mds_beacon_grace is high on the mon's too.
it doesn't matter which mds you select to be the running one.
Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
Yes, MDS eats all memory+swap, stays like this for a moment and then
frees memory.
mds_beacon_grace was already set to 1800
Also on other it is seen this message: Map has assigned me to become a
standby.
Does it matter, which MDS we stop and which we leave running?
Anton
On 04.12.2020
How many active MDS's did you have? (max_mds == 1, right?)
Stop the other two MDS's so you can focus on getting exactly one running.
Tail the log file and see what it is reporting.
Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
while it is rejoining.
Is that single MDS
14 matches
Mail list logo