Hi, I have a CEPH 12.2.5 cluster running on 4 CentOS 7.3 servers with kernel 4.17.0, Including 3 mons, 16 osds, 2 mds(1active+1backup). I have some cllients mounted cephfs in kernel mode. Client A is using kernel 4.4.145, and others are using kernel 4.12.8. All of them are using ceph client version 0.94. My mount command is something like 'mount -t ceph mon1:6789:/dir1 /mnt/dir1 -o name=user1,secretfile=user1.secret'. Client A's ceph user is different from other clients. While I copied files on Client A yesterday, it hung and cannot umount anymore. I then restarted mds serveice and finally found all other clients hung there. Here is the logs. *ceph.audit.log:*
2018-08-06 10:04:14.345909 7f8a9fa27700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 32.978931 secs 2018-08-06 10:04:14.345936 7f8a9fa27700 0 log_channel(cluster) log [WRN] : slow request 32.978931 seconds old, received at 2018-08-06 10:03:41.366871: client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6 2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed to rdlock, waiting 2018-08-06 10:04:44.346568 7f8a9fa27700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 62.979643 secs 2018-08-06 10:04:44.346593 7f8a9fa27700 0 log_channel(cluster) log [WRN] : slow request 62.979643 seconds old, received at 2018-08-06 10:03:41.366871: client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6 2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed to rdlock, waiting 2018-08-06 10:04:44.347651 7f8a9fa27700 0 log_channel(cluster) log [WRN] : client.214486 isn't responding to mclientcaps(revoke), ino 0x10000259db6 pending pFc issued pFcb, sent 62.980452 seconds ago … 2018-08-06 12:59:24.589157 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 257 : cluster [WRN] client.214486 isn't responding to mclientcaps(revoke), ino 0x1000025a20d pending pFc issued pFcb, sent 7683.252197 seconds ago 2018-08-06 13:00:00.000152 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8150 : cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests 2018-08-06 13:21:14.618192 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 258 : cluster [WRN] 12 slow requests, 1 included below; oldest blocked for > 11853.251231 secs 2018-08-06 13:21:14.618203 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 259 : cluster [WRN] slow request 7683.171918 seconds old, received at 2018-08-06 11:13:11.446184: client_request(client.213537:308353 setfilelockrule 2, type 2, owner 12714292720879014315, pid 19091, start 0, length 0, wait 1 #0x100000648cc 2018-08-06 11:13:11.445425 caller_uid=48, caller_gid=48{}) currently acquired locks 2018-08-06 13:24:59.623303 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 260 : cluster [WRN] 12 slow requests, 1 included below; oldest blocked for > 12078.256355 secs 2018-08-06 13:24:59.623316 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067 261 : cluster [WRN] slow request 7683.023058 seconds old, received at 2018-08-06 11:16:56.600168: client_request(client.213537:308354 setfilelockrule 2, type 2, owner 12714292687012008619, pid 19092, start 0, length 0, wait 1 #0x100000648cc 2018-08-06 11:16:56.599432 caller_uid=48, caller_gid=48{}) currently acquired locks *ceph-mds.log:* 2018-08-06 15:09:57.700198 7f8a9ea25700 -1 received signal: Terminated from PID: 1 task name: /usr/lib/systemd/systemd --switched-root --system --deserialize 21 UID: 0 2018-08-06 15:09:57.700228 7f8a9ea25700 -1 mds.ceph-mon1 *** got signal Terminated *** 2018-08-06 15:09:57.700232 7f8a9ea25700 1 mds.ceph-mon1 suicide. wanted state up:active 2018-08-06 15:09:57.704117 7f8a9ea25700 1 mds.0.52 shutdown: shutting down rank 0 2018-08-06 15:10:48.244347 7fa6dee9e1c0 0 set uid:gid to 167:167 (ceph:ceph) 2018-08-06 15:10:48.244368 7fa6dee9e1c0 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 2683453 2018-08-06 15:10:48.246713 7fa6dee9e1c0 0 pidfile_write: ignore empty --pid-file 2018-08-06 15:10:52.753614 7fa6d7d62700 1 mds.ceph-mon1 handle_mds_map standby *ceph.log:* 2018-08-06 15:09:57.792010 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8158 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED) 2018-08-06 15:09:57.792151 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8159 : cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients failing to respond to capability release) 2018-08-06 15:09:57.792244 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8160 : cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests) 2018-08-06 15:09:57.942937 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8163 : cluster [INF] Standby daemon mds.ceph-mds assigned to filesystem cephfs as rank 0 2018-08-06 15:09:57.943174 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8164 : cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY) 2018-08-06 15:10:51.601347 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8184 : cluster [INF] daemon mds.ceph-mds is now active in filesystem cephfs as rank 0 2018-08-06 15:10:52.563221 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8186 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded) 2018-08-06 15:10:52.563320 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8187 : cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons available) 2018-08-06 15:10:52.563371 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8188 : cluster [INF] Cluster is now healthy 2018-08-06 15:10:49.574055 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127 10 : cluster [WRN] evicting unresponsive client docker38 (213525), after waiting 45 seconds during MDS startup 2018-08-06 15:10:49.574168 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127 11 : cluster [WRN] evicting unresponsive client docker74 (213534), after waiting 45 seconds during MDS startup 2018-08-06 15:10:49.574259 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127 12 : cluster [WRN] evicting unresponsive client docker73 (213537), after waiting 45 seconds during MDS startup 'client.214486:3553922' is client A, and docker38, 73, 74 are the other clients. All clients hung there while I cannot umount/ls/cd the mounted dir. I think docker38, 73 and 74 are evicted because I restart MDS without any barrier operations(see this <http://docs.ceph.com/docs/luminous/cephfs/eviction/#background-blacklisting-and-osd-epoch-barrier>). But how did the client A hung? Is there any way to deal with the hung mounted dir except rebooting the server? Thanks
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com