On 1/21/21 16:51, Dan van der Ster wrote:
Hi all,
During rejoin an MDS can sometimes go OOM if the openfiles table is too large.
The workaround has been described by ceph devs as "rados rm -p
cephfs_metadata mds0_openfiles.0".
On our cluster we have several such objects for rank 0:
mds0_openfi
Just to follow up with an anecdote -- I had asked the question because
we had to do a planned failover of one of our MDSs.
The intervention went well and we didn't need to remove the openfiles
table objects.
We stopped the active mds.0 then the standby took over -- the rejoin
step took around 5 mi