Hi,
  I have a CEPH 12.2.5 cluster running on 4 CentOS 7.3 servers with kernel
4.17.0, Including 3 mons, 16 osds, 2 mds(1active+1backup). I have some
cllients mounted cephfs in kernel mode. Client A is using kernel 4.4.145,
and others are using kernel 4.12.8. All of them are using ceph client
version 0.94.
  My mount command is something like 'mount -t ceph mon1:6789:/dir1
/mnt/dir1 -o name=user1,secretfile=user1.secret'. Client A's ceph user is
different from other clients. While I copied files on Client A yesterday,
it hung and cannot umount anymore. I then restarted mds serveice and
finally found all other clients hung there. Here is the logs.
*ceph.audit.log:*

2018-08-06 10:04:14.345909 7f8a9fa27700  0 log_channel(cluster) log [WRN] :
1 slow requests, 1 included below; oldest blocked for > 32.978931 secs

2018-08-06 10:04:14.345936 7f8a9fa27700  0 log_channel(cluster) log [WRN] :
slow request 32.978931 seconds old, received at 2018-08-06 10:03:41.366871:
client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6
2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed
to rdlock, waiting

2018-08-06 10:04:44.346568 7f8a9fa27700  0 log_channel(cluster) log [WRN] :
1 slow requests, 1 included below; oldest blocked for > 62.979643 secs

2018-08-06 10:04:44.346593 7f8a9fa27700  0 log_channel(cluster) log [WRN] :
slow request 62.979643 seconds old, received at 2018-08-06 10:03:41.366871:
client_request(client.214486:3553922 getattr pAsLsXsFs #0x10000259db6
2018-08-06 10:03:44.346116 caller_uid=0, caller_gid=99{}) currently failed
to rdlock, waiting

2018-08-06 10:04:44.347651 7f8a9fa27700  0 log_channel(cluster) log [WRN] :
client.214486 isn't responding to mclientcaps(revoke), ino 0x10000259db6
pending pFc issued pFcb, sent 62.980452 seconds ago

…

2018-08-06 12:59:24.589157 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067
257 : cluster [WRN] client.214486 isn't responding to mclientcaps(revoke),
ino 0x1000025a20d pending pFc issued pFcb, sent 7683.252197 seconds ago

2018-08-06 13:00:00.000152 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8150 :
cluster [WRN] overall HEALTH_WARN 1 clients failing to respond to
capability release; 1 MDSs report slow requests

2018-08-06 13:21:14.618192 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067
258 : cluster [WRN] 12 slow requests, 1 included below; oldest blocked for
> 11853.251231 secs

2018-08-06 13:21:14.618203 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067
259 : cluster [WRN] slow request 7683.171918 seconds old, received at
2018-08-06 11:13:11.446184: client_request(client.213537:308353
setfilelockrule 2, type 2, owner 12714292720879014315, pid 19091, start 0,
length 0, wait 1 #0x100000648cc 2018-08-06 11:13:11.445425 caller_uid=48,
caller_gid=48{}) currently acquired locks

2018-08-06 13:24:59.623303 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067
260 : cluster [WRN] 12 slow requests, 1 included below; oldest blocked for
> 12078.256355 secs

2018-08-06 13:24:59.623316 mds.ceph-mon1 mds.0 10.211.121.61:6804/4260949067
261 : cluster [WRN] slow request 7683.023058 seconds old, received at
2018-08-06 11:16:56.600168: client_request(client.213537:308354
setfilelockrule 2, type 2, owner 12714292687012008619, pid 19092, start 0,
length 0, wait 1 #0x100000648cc 2018-08-06 11:16:56.599432 caller_uid=48,
caller_gid=48{}) currently acquired locks

*ceph-mds.log:*

2018-08-06 15:09:57.700198 7f8a9ea25700 -1 received  signal: Terminated
from  PID: 1 task name: /usr/lib/systemd/systemd --switched-root --system
--deserialize 21  UID: 0

2018-08-06 15:09:57.700228 7f8a9ea25700 -1 mds.ceph-mon1 *** got signal
Terminated ***

2018-08-06 15:09:57.700232 7f8a9ea25700  1 mds.ceph-mon1 suicide.  wanted
state up:active

2018-08-06 15:09:57.704117 7f8a9ea25700  1 mds.0.52 shutdown: shutting down
rank 0

2018-08-06 15:10:48.244347 7fa6dee9e1c0  0 set uid:gid to 167:167
(ceph:ceph)

2018-08-06 15:10:48.244368 7fa6dee9e1c0  0 ceph version 12.2.5
(cad919881333ac92274171586c827e01f554a70a) luminous (stable), process
(unknown), pid 2683453

2018-08-06 15:10:48.246713 7fa6dee9e1c0  0 pidfile_write: ignore empty
--pid-file

2018-08-06 15:10:52.753614 7fa6d7d62700  1 mds.ceph-mon1 handle_mds_map
standby

*ceph.log:*

2018-08-06 15:09:57.792010 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8158 :
cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)

2018-08-06 15:09:57.792151 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8159 :
cluster [INF] Health check cleared: MDS_CLIENT_LATE_RELEASE (was: 1 clients
failing to respond to capability release)

2018-08-06 15:09:57.792244 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8160 :
cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report
slow requests)

2018-08-06 15:09:57.942937 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8163 :
cluster [INF] Standby daemon mds.ceph-mds assigned to filesystem cephfs as
rank 0

2018-08-06 15:09:57.943174 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8164 :
cluster [WRN] Health check failed: insufficient standby MDS daemons
available (MDS_INSUFFICIENT_STANDBY)

2018-08-06 15:10:51.601347 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8184 :
cluster [INF] daemon mds.ceph-mds is now active in filesystem cephfs as
rank 0

2018-08-06 15:10:52.563221 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8186 :
cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is
degraded)

2018-08-06 15:10:52.563320 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8187 :
cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was:
insufficient standby MDS daemons available)

2018-08-06 15:10:52.563371 mon.ceph-mon1 mon.0 10.211.121.61:6789/0 8188 :
cluster [INF] Cluster is now healthy

2018-08-06 15:10:49.574055 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127
10 : cluster [WRN] evicting unresponsive client docker38 (213525), after
waiting 45 seconds during MDS startup

2018-08-06 15:10:49.574168 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127
11 : cluster [WRN] evicting unresponsive client docker74 (213534), after
waiting 45 seconds during MDS startup

2018-08-06 15:10:49.574259 mds.ceph-mds mds.0 10.211.132.103:6804/3490526127
12 : cluster [WRN] evicting unresponsive client docker73 (213537), after
waiting 45 seconds during MDS startup


  'client.214486:3553922' is client A, and docker38, 73, 74 are the other
clients. All clients hung there while I cannot umount/ls/cd the mounted dir.
I think docker38, 73 and 74 are evicted because I restart MDS  without any
barrier operations(see this
<http://docs.ceph.com/docs/luminous/cephfs/eviction/#background-blacklisting-and-osd-epoch-barrier>).
But how did the client A hung? Is there any way to deal with the hung
mounted dir except rebooting the server?


Thanks
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to