Re: [ceph-users] Revisiting: Many clients (X) failing to respond to cache pressure

2016-12-15 Thread Goncalo Borges

Hi John...

Regarding logs, we still do not have them available. We just realized 
that ceph-fuse tries to log to /var/log/ceph, which in our case didn't 
exist in the clients. So, we had to create that directory everywhere, 
and we are in the process of remounting every client so that they starts 
logging. Since by umounting we are forcing the client to free the 
inodes, we have to wait for the situation to reappear.


However. I have a bit further information. Maybe it can shed a further 
light on the topic.


- If I currently loop through all my clients, I now get a total of 29604 
inodes.


 $ cat clients_inodes_20161216-0938.txt | grep inode_count | awk
   '{print $2}' | sed 's/,//g' | awk '{s+=$1} END {print s}'
   29604

- However, the mds reports '"inodes": 1779521' and '"inodes_with_caps": 
32823,'. Is there a need for the MDS to keep in memory such a large 
amount of inodes without associated caps? I also expect that these are 
the first ones to be trimmed once inodes > inode_max.


  "mds": {
(...)
"inode_max": 200,
"inodes": 1779521,
"inodes_top": 18119,
"inodes_bottom": 1594129,
"inodes_pin_tail": 167273,
"inodes_pinned": 182643,
"inodes_expired": 53391877,
"inodes_with_caps": 32823,
"caps": 35133,
   (...)
},

- I am also seeing some false positives (I think). As I explained 
before, we have currently umounted all clients except 2 (they are 
interactive machines where our users run tmux sessions and so on... So, 
it is hard to kick them out :-) ). One of those two is still reported as 
problematic by MDS although inodes < inodes_max. Looking to the number 
of inodes of that machine, I get "inode_count": 13862. So, it seems that 
the client is still tagged as problematic although it has an inode_count 
bellow 16384 and inodes < inodes_max. Maybe a consequence of 
https://github.com/ceph/ceph/pull/11373 ? And this fix seems to only go 
on Kraken?


Cheers

Goncalo





Cheers

Goncalo



On 12/14/2016 10:16 AM, Goncalo Borges wrote:

Hi John.

Comments in line.



Hi Ceph(FS)ers...

I am currently running in production the following environment:

- ceph/cephfs in 10.2.2.
- All infrastructure is in the same version (rados cluster, mons, 
mds and

cephfs clients).
- We mount cephfs using ceph-fuse.

Since yesterday that we have our cluster in warning state with the 
message
"mds0: Many clients (X) failing to respond to cache pressure". X has 
been
changing with time, from ~130 to ~70. I am able to correlate the 
appearance

of this message with burst of jobs in our cluster.

This subject has been discussed in the mailing list a lot of times, and
normally, the recipe is to look for something wrong in the clients. 
So, I

have tried to look to clients first:

1) I've started to loop through all my clients, and run 'ceph 
--admin-daemon
/var/run/ceph/ceph-client.mount_user.asok status' to get the 
inodes_count

reported in each client.

$ cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g' | 
awk

'{s+=$1} END {print s}'
2407659

2) I've then compared with the number of inodes the mds had in its 
cache

(obtained by a perf dump)
  inode_max": 200 and "inodes": 2413826

3) I've tried to understand how many clients had a number of inodes 
higher

than 16384 (the default) and got

$ for i in `cat all.txt | grep inode_count | awk '{print $2}' | sed 
's/,//g'

`; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
27

4) My conclusion is that the core of inodes is held by a couple of 
machines.

However, while the majority is running user jobs, others are not doing
anything at all. For example, an idle machine (which had no users 
logged in,
no jobs running, updatedb does not search for cephfs filesystem) 
reported

more than > 30 inodes). To regain those inodes, I had to umount and
remount cephfs in that machine.

5) Based on my previous observations I suspect that there are still 
some
problems in the ceph-fuse client regarding recovering these inodes 
(or it

happens at a very slow rate).

Seems that way.  Can you come up with a reproducer for us, and/or
gather some client+mds debug logs where a client is failing to respond
to cache pressure?


I think I've nailed this down to a specific user workload. Everytime 
this user runs, it lefts the client with a huge number of inodes, 
normally more than 10. The workload consists in the generations of 
a big number of analysis files spread over multiple directories. I am 
going to try to inject some debug parameters and see what do we come 
up with. Will reply on this thread later on.




Also, what kernel is in use on the clients?  It's possible that the
issue is in FUSE itself (or the way that it responses to ceph-fuse's
attempts to ask for some inodes to be released).


All our clusters run SL6 because CERN experiments software is only 
certified to that OS flavour. Because of the SL6 restriction, to 
enable pos infernalis 

Re: [ceph-users] Revisiting: Many clients (X) failing to respond to cache pressure

2016-12-13 Thread Goncalo Borges

Hi John.

Comments in line.



Hi Ceph(FS)ers...

I am currently running in production the following environment:

- ceph/cephfs in 10.2.2.
- All infrastructure is in the same version (rados cluster, mons, mds and
cephfs clients).
- We mount cephfs using ceph-fuse.

Since yesterday that we have our cluster in warning state with the message
"mds0: Many clients (X) failing to respond to cache pressure". X has been
changing with time, from ~130 to ~70. I am able to correlate the appearance
of this message with burst of jobs in our cluster.

This subject has been discussed in the mailing list a lot of times, and
normally, the recipe is to look for something wrong in the clients. So, I
have tried to look to clients first:

1) I've started to loop through all my clients, and run 'ceph --admin-daemon
/var/run/ceph/ceph-client.mount_user.asok status' to get the inodes_count
reported in each client.

$ cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g' | awk
'{s+=$1} END {print s}'
2407659

2) I've then compared with the number of inodes the mds had in its cache
(obtained by a perf dump)
  inode_max": 200 and "inodes": 2413826

3) I've tried to understand how many clients had a number of inodes higher
than 16384 (the default) and got

$ for i in `cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g'
`; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
27

4) My conclusion is that the core of inodes is held by a couple of machines.
However, while the majority is running user jobs, others are not doing
anything at all. For example, an idle machine (which had no users logged in,
no jobs running, updatedb does not search for cephfs filesystem) reported
more than > 30 inodes). To regain those inodes, I had to umount and
remount cephfs in that machine.

5) Based on my previous observations I suspect that there are still some
problems in the ceph-fuse client regarding recovering these inodes (or it
happens at a very slow rate).

Seems that way.  Can you come up with a reproducer for us, and/or
gather some client+mds debug logs where a client is failing to respond
to cache pressure?


I think I've nailed this down to a specific user workload. Everytime 
this user runs, it lefts the client with a huge number of inodes, 
normally more than 10. The workload consists in the generations of a 
big number of analysis files spread over multiple directories. I am 
going to try to inject some debug parameters and see what do we come up 
with. Will reply on this thread later on.




Also, what kernel is in use on the clients?  It's possible that the
issue is in FUSE itself (or the way that it responses to ceph-fuse's
attempts to ask for some inodes to be released).


All our clusters run SL6 because CERN experiments software is only 
certified to that OS flavour. Because of the SL6 restriction, to enable 
pos infernalis ceph clients in those machines, we have to recompile them 
as well as some of the dependencies it needs and which are not available 
in SL6. In summary, we recompile ceph-fuse 10.2.2 with gcc 4.8.4 against 
boost-1.53.0-25 and fuse-2.9.7. The kernel version in the clients is 
2.6.32-642.6.2.el6.x86_64


Thanks for the explanations about the mds memory usage. I am glad there 
is something on its way to trigger a more effective memory usage


Cheers
Goncalo



However, I also do not completely understand what is happening on the server
side:

6) The current memory usage of my mds is the following:

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
17831 ceph  20   0 13.667g 0.012t  10048 S  37.5 40.2   1068:47 ceph-mds

The mds cache size is set to 200. Running 'ceph daemon mds. perf
dump', I get  "inode_max": 200 and "inodes": 2413826. Assuming 4k per
each inode one gets ~10G. So why it is taking much more than that?


7) I have been running cephfs for more than an year, and looking to ganglia,
the mds memory never decreases but always increases (even in cases when we
umount almost all the clients). Why does that happen?

Coincidentally someone posted about this on ceph-devel just yesterday.
The answer is that the MDS uses memory pools for allocation, and it
doesn't (currently) ever bother releasing memory back to the operating
system because it's doing its own cache size enforcement.  However,
when the cache size limits aren't being enforced (for example because
of clients failing to release caps) this becomes a problem.  There's a
patch for master (https://github.com/ceph/ceph/pull/12443)


8) I am running 2 mds, in active / standby-replay mode. The memory of the
standby-replay is much lower

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
   716 ceph  20   0 6149424 5.115g   8524 S   1.2 43.6  53:19.74 ceph-mds

If I trigger a restart on my active mds, the standby replay will start
acting as active, but will continue with the same amount of memory. Why the
second mds can become active, and do 

Re: [ceph-users] Revisiting: Many clients (X) failing to respond to cache pressure

2016-12-13 Thread John Spray
On Tue, Dec 13, 2016 at 6:03 AM, Goncalo Borges
 wrote:
> Hi Ceph(FS)ers...
>
> I am currently running in production the following environment:
>
> - ceph/cephfs in 10.2.2.
> - All infrastructure is in the same version (rados cluster, mons, mds and
> cephfs clients).
> - We mount cephfs using ceph-fuse.
>
> Since yesterday that we have our cluster in warning state with the message
> "mds0: Many clients (X) failing to respond to cache pressure". X has been
> changing with time, from ~130 to ~70. I am able to correlate the appearance
> of this message with burst of jobs in our cluster.
>
> This subject has been discussed in the mailing list a lot of times, and
> normally, the recipe is to look for something wrong in the clients. So, I
> have tried to look to clients first:
>
> 1) I've started to loop through all my clients, and run 'ceph --admin-daemon
> /var/run/ceph/ceph-client.mount_user.asok status' to get the inodes_count
> reported in each client.
>
> $ cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g' | awk
> '{s+=$1} END {print s}'
> 2407659
>
> 2) I've then compared with the number of inodes the mds had in its cache
> (obtained by a perf dump)
>  inode_max": 200 and "inodes": 2413826
>
> 3) I've tried to understand how many clients had a number of inodes higher
> than 16384 (the default) and got
>
> $ for i in `cat all.txt | grep inode_count | awk '{print $2}' | sed 's/,//g'
> `; do if [ $i -ge 16384 ]; then echo $i; fi; done | wc -l
> 27
>
> 4) My conclusion is that the core of inodes is held by a couple of machines.
> However, while the majority is running user jobs, others are not doing
> anything at all. For example, an idle machine (which had no users logged in,
> no jobs running, updatedb does not search for cephfs filesystem) reported
> more than > 30 inodes). To regain those inodes, I had to umount and
> remount cephfs in that machine.
>
> 5) Based on my previous observations I suspect that there are still some
> problems in the ceph-fuse client regarding recovering these inodes (or it
> happens at a very slow rate).

Seems that way.  Can you come up with a reproducer for us, and/or
gather some client+mds debug logs where a client is failing to respond
to cache pressure?

Also, what kernel is in use on the clients?  It's possible that the
issue is in FUSE itself (or the way that it responses to ceph-fuse's
attempts to ask for some inodes to be released).

> However, I also do not completely understand what is happening on the server
> side:
>
> 6) The current memory usage of my mds is the following:
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
> 17831 ceph  20   0 13.667g 0.012t  10048 S  37.5 40.2   1068:47 ceph-mds
>
> The mds cache size is set to 200. Running 'ceph daemon mds. perf
> dump', I get  "inode_max": 200 and "inodes": 2413826. Assuming 4k per
> each inode one gets ~10G. So why it is taking much more than that?
>
>
> 7) I have been running cephfs for more than an year, and looking to ganglia,
> the mds memory never decreases but always increases (even in cases when we
> umount almost all the clients). Why does that happen?

Coincidentally someone posted about this on ceph-devel just yesterday.
The answer is that the MDS uses memory pools for allocation, and it
doesn't (currently) ever bother releasing memory back to the operating
system because it's doing its own cache size enforcement.  However,
when the cache size limits aren't being enforced (for example because
of clients failing to release caps) this becomes a problem.  There's a
patch for master (https://github.com/ceph/ceph/pull/12443)

>
> 8) I am running 2 mds, in active / standby-replay mode. The memory of the
> standby-replay is much lower
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
>   716 ceph  20   0 6149424 5.115g   8524 S   1.2 43.6  53:19.74 ceph-mds
>
> If I trigger a restart on my active mds, the standby replay will start
> acting as active, but will continue with the same amount of memory. Why the
> second mds can become active, and do the same job but using much more
> memory?

Presumably this also makes sense once you know about the allocator in use.

> 9) Finally, I am sending an extract of 'ceph daemon mds. perf dump' from
> my active and standby mdses. What is exactly the meaning of inodes_pin_tail,
> inodes_expired and inodes_with_caps? Is the standby mds suppose to show the
> same numbers? They don't...

It's not really possible to explain these counters without a
substantial explanation of MDS internals, sorry.  I will say though
that there is absolutely no guarantee of performance counters on the
standby replay daemon matching those on the active daemon.

John

> Thanks in advance for your answers /  suggestions
>
> Cheers
>
> Goncalo
>
>
>
> active:
>
> "mds": {
> "request": 93941296,
> "reply": 93940671,
> "reply_latency": {