Re: [ceph-users] cephfs clients hanging multi mds to single mds

Burkhard Linke Mon, 01 Oct 2018 11:13:24 -0700

Hi,

we also experience hanging clients after MDS restarts; in our case weonly use a single active MDS server, and the client are activelyblacklisted by the MDS server after restart. It usually happens if theclients are not responsive during MDS restart (e.g. being very busy).

You can check whether this is the case in your setup by inspecting theblacklist ('ceph osd blacklist ls'). It should print the connectionswhich are currently blacklisted.

You can also remove entries ('ceph osd blacklist rm ...'), but be warnedthat the mechanism is there for a reason. Removing a blacklisted entrymight result in file corruption if client and MDS server disagree aboutthe current state. Use at own risk.

We were also trying a multi active setup after upgrading to luminous,but we were running into the same problem with the same error message.If was probably due to old kernel clients, so in case of kernel basedcephfs I would recommend to upgrade to the latest available kernel.

As another approach you can check the current state of the cephfsclient, either by using the daemon socket in case of ceph-fuse, or thedebug information in /sys/kernel/debug/ceph/... for the kernel client.


Regards,

Burkhard


On 01.10.2018 18:34, Jaime Ibar wrote:

Hi all,
we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabledmulti mds and after few hours
these errors started showing up
2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 secondsold, received at 2018-09-28 09:40:16.155841:client_request(client.31059144:8544450 getattr Xs #0$100002e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})currently failed to authpin local pins
2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clientsfailing to respond to cache pressure (MDS_CLIENT_RECALL)2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 includedbelow; oldest blocked for > 4614.580689 secs2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 secondsold, received at 2018-09-28 10:53:03.203476:client_request(client.31059144:9080057 lookup #0x100000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})currently initiated2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clientsfailing to respond to capability release; 5 clients failing to respondto cache pressure; 1 MDSs report slow requests,
Due to this, we decide to go back to single mds(as it worked before),however, the clients pointing to mds.1 started hanging, however, theones pointing to mds.0 worked fine.
Then, we tried to enable multi mds again and the clients pointingmds.1 went back online, however the ones pointing to mds.0 stopped work.
Today, we tried to go back to single mds, however this error waspreventing ceph to disable second active mds(mds.1)
2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive clientXXXXX: (30108925), after 68213.084174 seconds
After wait for 3 hours, we restarted mds.1 daemon (as it was stuck instopping state forever due to the above error), we waited for it tobecome active again,
unmount the problematic clients, wait for the cluster to be healthyand try to go back to single mds again.
Apparently this worked with some of the clients, we tried to enablemulti mds again to bring faulty clients back again, however no luckthis time
and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
Not sure what else can we try to bring hanging clients back withoutrebooting as they're in production and rebooting is not an option.
Does anyone know how can we deal with this, please?

Thanks

Jaime


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs clients hanging multi mds to single mds

Reply via email to