Hi Ceph(FS)ers...
I am currently running in production the following environment:
- ceph/cephfs in 10.2.2.
- All infrastructure is in the same version (rados cluster, mons,
mds and cephfs clients).
- We mount cephfs using ceph-fuse.
Since yesterday that we have our cluster in warning state with the
message "_mds0: Many clients (X) failing to respond to cache pressure_".
X has been changing with time, from ~130 to ~70. I am able to correlate
the appearance of this message with burst of jobs in our cluster.
This subject has been discussed in the mailing list a lot of times, and
normally, the recipe is to look for something wrong in the clients. So,
I have tried to look to clients first:
1) I've started to loop through all my clients, and run 'ceph
--admin-daemon /var/run/ceph/ceph-client.mount_user.asok status' to
get the inodes_count reported in each client.
$ cat all.txt | grep inode_count | awk '{print $2}' | sed
's/,//g' | awk '{s+=$1} END {print s}'
2407659
2) I've then compared with the number of inodes the mds had in its
cache (obtained by a perf dump)
inode_max": 2000000 and "inodes": 2413826
3) I've tried to understand how many clients had a number of inodes
higher than 16384 (the default) and got
$ for i in `cat all.txt | grep inode_count | awk '{print $2}' |
sed 's/,//g' `; do if [ $i -ge 16384 ]; then echo $i; fi; done |
wc -l
27
4) My conclusion is that the core of inodes is held by a couple of
machines. However, while the majority is running user jobs, others
are not doing anything at all. For example, an idle machine (which
had no users logged in, no jobs running, updatedb does not search
for cephfs filesystem) reported more than > 300000 inodes). To
regain those inodes, I had to umount and remount cephfs in that
machine.
5) Based on my previous observations I suspect that there are still
some problems in the ceph-fuse client regarding recovering these
inodes (or it happens at a very slow rate).
However, I also do not completely understand what is happening on the
server side:
6) The current memory usage of my mds is the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
17831 ceph 20 0 13.667g 0.012t 10048 S 37.5 40.2
1068:47 ceph-mds
The mds cache size is set to 2000000. Running 'ceph daemon mds.<id>
perf dump', I get "inode_max": 2000000 and "inodes": 2413826.
Assuming 4k per each inode one gets ~10G. So why it is taking much
more than that?
7) I have been running cephfs for more than an year, and looking to
ganglia, the mds memory never decreases but always increases (even
in cases when we umount almost all the clients). Why does that happen?
8) I am running 2 mds, in active / standby-replay mode. The memory
of the standby-replay is much lower
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
716 ceph 20 0 6149424 5.115g 8524 S 1.2 43.6 53:19.74
ceph-mds
If I trigger a restart on my active mds, the standby replay will
start acting as active, but will continue with the same amount of
memory. Why the second mds can become active, and do the same job
but using much more memory?
9) Finally, I am sending an extract of 'ceph daemon mds.<id> perf
dump' from my active and standby mdses. What is exactly the meaning
of inodes_pin_tail, inodes_expired and inodes_with_caps? Is the
standby mds suppose to show the same numbers? They don't...
Thanks in advance for your answers / suggestions
Cheers
Goncalo
*active:*
"mds": {
"request": 93941296,
"reply": 93940671,
"reply_latency": {
"avgcount": 93940671,
"sum": 188398.004552299
},
"forward": 0,
"dir_fetch": 309878,
"dir_commit": 1736194,
"dir_split": 0,
"inode_max": 2000000,
"inodes": 2413826,
"inodes_top": 201,
"inodes_bottom": 568,
"inodes_pin_tail": 2413057,
"inodes_pinned": 2413303,
"inodes_expired": 19693168,
"inodes_with_caps": 2409737,
"caps": 2440565,
"subtrees": 2,
"traverse": 113291068,
"traverse_hit": 57822611,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 154708,
"traverse_remote_ino": 1085,
"traverse_lock": 66063,
"load_cent": 9394314733,
"q": 22,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},
*standby-replay:*
"mds": {
"request": 0,
"reply": 0,
"reply_latency": {
"avgcount": 0,
"sum": 0.000000000
},
"forward": 0,
"dir_fetch": 0,
"dir_commit": 0,
"dir_split": 0,
"inode_max": 2000000,
"inodes": 2000058,
"inodes_top": 0,
"inodes_bottom": 1993207,
"inodes_pin_tail": 6851,
"inodes_pinned": 124135,
"inodes_expired": 10651484,
"inodes_with_caps": 0,
"caps": 0,
"subtrees": 2,
"traverse": 0,
"traverse_hit": 0,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 0,
"traverse_remote_ino": 0,
"traverse_lock": 0,
"load_cent": 0,
"q": 0,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com