[ceph-users] CephFS HA: mgr finish mon failed to return metadata for mds

2024-05-29 Thread kvesligaj
Hi,

we have a stretched cluster (Reef 18.2.1) with 5 nodes (2 nodes on each side + 
witness). You can se our daemon placement below.

[admin]
ceph-admin01 labels="['_admin', 'mon', 'mgr']"

[nodes]
[DC1]
ceph-node01 labels="['mon', 'mgr', 'mds', 'osd']"
ceph-node02 labels="['mon', 'rgw', 'mds', 'osd']"
[DC2]
ceph-node03 labels="['mon', 'mgr', 'mds', 'osd']"
ceph-node04 labels="['mon', 'rgw', 'mds', 'osd']"

We have been testing CephFS HA and noticed when we have active MDS (we have two 
active MDS daemons at all times) and active MGR (MGR is either on admin node or 
in one of the DC's) in one DC and when we shutdown that site (DC) we have a 
problem when one of the MDS metadata can't be retrieved thus showing in logs as:

"mgr finish mon failed to return metadata for mds"

After we turn that site back on the problem persists and metadata of MDS in 
question can't be retrieved with "ceph mds metadata"

After I manually fail MDS daemon in question with "ceph mds fail" the problem 
is solved and I can retrieve MDS metadata.

My question is, would this be related to the following bug 
(https://tracker.ceph.com/issues/63166) - I can see that it is showed as 
backported to 18.2.1 but I can't find it in release notes for Reef.

Second question is should this work in current configuration at all as MDS and 
MGR are both at the same moment disconnected from the rest of the cluster?

And final question would be what would be the solution here and is there any 
loss of data when this happens?

Any help is appreciated.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [MDS] mds stuck in laggy state, CephFS unusable

2023-11-27 Thread kvesligaj
Hi,

we're having a peculiar issue which we found out during HA/DR testing in our 
Ceph cluster.

Basic info about cluster: 
Version: Quincy (17.2.6)
5 nodes configured in stretch cluster (2 DCs with one arbiter node which is 
also admin node for the cluster)
On every node beside the admin node we have OSD and MON services. We have 3 MGR 
instances in cluster.

Specific thing that we wanted to test is multiple CephFS with each having 
multiple MDS (with HA in mind). 
We deployed MDS on every node, increased max_mds to 2 for every CephFS and 
other two MDS-es are in standby-replay mode (they are automatically configured 
during CephFS creation to follow specific CephFS - join_fscid).

We did multiple tests and when we have only one CephFS it behaves as expected 
(two MDS are in up:active state and clients can connect and interact with 
CephFS as if nothing has happened).

When we test with multiple CephFS (two for example) and we shutdown two nodes 
one of MDS is stuck in up:active laggy state and when this happens the CephFS 
for which this happens is unusable, client hangs and it is stuck like that 
until we power on other DC. This happens even when there are no clients 
connected to this specific CephFS.

We can provide additional logs and do any tests necessary. We checked the usual 
culprits and our nodes don't show any excessive CPU or memory usage.

We would appreciate any help.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io