On Fri, Oct 23, 2020 at 9:02 AM David C wrote:
>
> Success!
>
> I remembered I had a server I'd taken out of the cluster to
> investigate some issues, that had some good quality 800GB Intel DC
> SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes,
> added an MDS to that server and
Success!
I remembered I had a server I'd taken out of the cluster to
investigate some issues, that had some good quality 800GB Intel DC
SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes,
added an MDS to that server and let it run. Took 3 - 4 hours but
eventually came back online.
The post was titled "mds behind on trimming - replay until memory exhausted".
> Load up with swap and try the up:replay route.
> Set the beacon to 10 until it finishes.
Good point! The MDS will not send beacons for a long time. Same was necessary
in the other case.
Good luck!
: 22 October 2020 18:11:57
To: David C
Cc: ceph-devel; ceph-users
Subject: [ceph-users] Re: Urgent help needed please - MDS offline
I assume you aren't able to quickly double the RAM on this MDS ? or
failover to a new MDS with more ram?
Failing that, you shouldn't reset the journal without
He could quickly add
> sufficient swap and the MDS managed to come up. Took a long time though,
> but might be faster than getting more RAM and will not loose data.
> >> >
> >> > Your clients will not be able to do much, if anything during recovery
> though.
> >
be faster than getting more RAM and will not loose data.
>> >
>> > Your clients will not be able to do much, if anything during recovery
>> > though.
>> >
>> > Best regards,
>> > =====
>> > Frank Schilder
>> > AIT Risø Campu
gt; >
> > Your clients will not be able to do much, if anything during recovery
> though.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > _________________
Dan van der Ster
> Sent: 22 October 2020 18:11:57
> To: David C
> Cc: ceph-devel; ceph-users
> Subject: [ceph-users] Re: Urgent help needed please - MDS offline
>
> I assume you aren't able to quickly double the RAM on this MDS ? or
> failover to a new MDS with more ram?
>
&g
I assume you aren't able to quickly double the RAM on this MDS ? or
failover to a new MDS with more ram?
Failing that, you shouldn't reset the journal without recovering
dentries, otherwise the cephfs_data objects won't be consistent with
the metadata.
The full procedure to be used is here:
I'm pretty sure it's replaying the same ops every time, the last
"EMetaBlob.replay updated dir" before it dies is always referring to
the same directory. Although interestingly that particular dir shows
up in the log thousands of times - the dir appears to be where a
desktop app is doing some
I wouldn't adjust it.
Do you have the impression that the mds is replaying the exact same ops every
time the mds is restarting? or is it progressing and trimming the
journal over time?
The only other advice I have is that 12.2.10 is quite old, and might
miss some important replay/mem fixes.
I'm
I've not touched the journal segments, current value of
mds_log_max_segments is 128. Would you recommend I increase (or
decrease) that value? And do you think I should change
mds_log_max_expiring to match that value?
On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster wrote:
>
> You could decrease
You could decrease the mds_cache_memory_limit but I don't think this
will help here during replay.
You can see a related tracker here: https://tracker.ceph.com/issues/47582
This is possibly caused by replaying a very large journal. Did you
increase the journal segments?
-- dan
-- dan
On
Dan, many thanks for the response.
I was going down the route of looking at mds_beacon_grace but I now
realise when I start my MDS, it's swallowing up memory rapidly and
looks like the oom-killer is eventually killing the mds. With debug
upped to 10, I can see it's doing EMetaBlob.replays on
You can disable that beacon by increasing mds_beacon_grace to 300 or
600. This will stop the mon from failing that mds over to a standby.
I don't know if that is set on the mon or mgr, so I usually set it on both.
(You might as well disable the standby too -- no sense in something
failing back and
15 matches
Mail list logo