Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-15 Thread John Spray



On 15/07/15 04:06, Eric Eastman wrote:

Hi John,

I cut the test down to a single client running only Ganesha NFS
without any ceph drivers loaded on the Ceph FS client.  After deleting
all the files in the Ceph file system, rebooting all the nodes, I
restarted the create 5 million file test using 2 NFS clients to the
one Ceph file system node running Ganesha NFS. After a couple hours I
am seeing the  client ede-c2-gw01 failing to respond to cache pressure
error:


Thanks -- that's a very useful datapoint.  I've created a ticket here:
http://tracker.ceph.com/issues/12334

Looking forward to seeing if samba has the same issue.

John

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-14 Thread Eric Eastman
Hi John,

I cut the test down to a single client running only Ganesha NFS
without any ceph drivers loaded on the Ceph FS client.  After deleting
all the files in the Ceph file system, rebooting all the nodes, I
restarted the create 5 million file test using 2 NFS clients to the
one Ceph file system node running Ganesha NFS. After a couple hours I
am seeing the  client ede-c2-gw01 failing to respond to cache pressure
error:

$ ceph -s
cluster 6d8aae1e-1125-11e5-a708-001b78e265be
 health HEALTH_WARN
mds0: Client ede-c2-gw01 failing to respond to cache pressure
 monmap e1: 3 mons at
{ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
election epoch 22, quorum 0,1,2
ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
 mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby
 osdmap e323: 8 osds: 8 up, 8 in
  pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects
182 GB used, 78459 MB / 263 GB avail
 832 active+clean

Dumping the mds daemon shows inodes  inodes_max:

# ceph daemon mds.ede-c2-mds02 perf dump mds
{
mds: {
request: 21862302,
reply: 21862302,
reply_latency: {
avgcount: 21862302,
sum: 16728.480772060
},
forward: 0,
dir_fetch: 13,
dir_commit: 50788,
dir_split: 0,
inode_max: 10,
inodes: 100010,
inodes_top: 0,
inodes_bottom: 0,
inodes_pin_tail: 100010,
inodes_pinned: 100010,
inodes_expired: 4308279,
inodes_with_caps: 8,
caps: 8,
subtrees: 2,
traverse: 30802465,
traverse_hit: 26394836,
traverse_forward: 0,
traverse_discover: 0,
traverse_dir_fetch: 0,
traverse_remote_ino: 0,
traverse_lock: 0,
load_cent: 2186230200,
q: 0,
exported: 0,
exported_inodes: 0,
imported: 0,
imported_inodes: 0
}
}

Once this test finishes and I verify the files were all correctly
written, I will retest using the SAMBA VFS interface, followed by the
kernel test.

Please let me know if there is more info you need and if you want me
to open a ticket.

Best regards
Eric



On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman
eric.east...@keepertech.com wrote:
 Thanks John. I will back the test down to the simple case of 1 client
 without the kernel driver and only running NFS Ganesha, and work forward
 till I trip the problem and report my findings.

 Eric

 On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote:



 On 13/07/2015 04:02, Eric Eastman wrote:

 Hi John,

 I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
 nodes.  This system is using 4 Ceph FS client systems. They all have
 the kernel driver version of CephFS loaded, but none are mounting the
 file system. All 4 clients are using the libcephfs VFS interface to
 Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
 share out the Ceph file system.

 # ceph -s
  cluster 6d8aae1e-1125-11e5-a708-001b78e265be
   health HEALTH_WARN
  4 near full osd(s)
  mds0: Client ede-c2-gw01 failing to respond to cache
 pressure
  mds0: Client ede-c2-gw02:cephfs failing to respond to cache
 pressure
  mds0: Client ede-c2-gw03:cephfs failing to respond to cache
 pressure
   monmap e1: 3 mons at

 {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
  election epoch 8, quorum 0,1,2
 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
   mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
   osdmap e272: 8 osds: 8 up, 8 in
pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
  212 GB used, 48715 MB / 263 GB avail
   832 active+clean
client io 1379 kB/s rd, 20653 B/s wr, 98 op/s


 It would help if we knew whether it's the kernel clients or the userspace
 clients that are generating the warnings here.  You've probably already done
 this, but I'd get rid of any unused kernel client mounts to simplify the
 situation.

 We haven't tested the cache limit enforcement with NFS Ganesha, so there
 is a decent chance that it is broken.  The ganehsha FSAL is doing
 ll_get/ll_put reference counting on inodes, so it seems quite possible that
 its cache is pinning things that we would otherwise be evicting in response
 to cache pressure.  You mention samba as well,

 You can see if the MDS cache is indeed exceeding its limit by looking at
 the output of:
 ceph daemon mds.daemon id perf dump mds

 ...where the inodes value tells you how many are in the cache, vs.
 inode_max.

 If you can, it would be useful to boil this down to a straightforward test
 case: if you start with a healthy cluster, mount a single ganesha client,
 and do your 5 million file procedure, do you get the warning?  

Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-14 Thread 谷枫
I change the mds_cache_size to 50 from 10 get rid of the
WARN temporary.
Now dumping the mds daemon shows like this:
inode_max: 50,
inodes: 124213,
But i have no idea if the indoes rises more than 50 , change the
mds_cache_size again?
Thanks.

2015-07-15 13:34 GMT+08:00 谷枫 feiche...@gmail.com:

 I change the mds_cache_size to 50 from 10 get rid of the
 WARN temporary.
 Now dumping the mds daemon shows like this:
 inode_max: 50,
 inodes: 124213,
 But i have no idea if the indoes rises more than 50 , change the
 mds_cache_size again?
 Thanks.

 2015-07-15 11:06 GMT+08:00 Eric Eastman eric.east...@keepertech.com:

 Hi John,

 I cut the test down to a single client running only Ganesha NFS
 without any ceph drivers loaded on the Ceph FS client.  After deleting
 all the files in the Ceph file system, rebooting all the nodes, I
 restarted the create 5 million file test using 2 NFS clients to the
 one Ceph file system node running Ganesha NFS. After a couple hours I
 am seeing the  client ede-c2-gw01 failing to respond to cache pressure
 error:

 $ ceph -s
 cluster 6d8aae1e-1125-11e5-a708-001b78e265be
  health HEALTH_WARN
 mds0: Client ede-c2-gw01 failing to respond to cache pressure
  monmap e1: 3 mons at
 {ede-c2-mon01=
 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0
 }
 election epoch 22, quorum 0,1,2
 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
  mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby
  osdmap e323: 8 osds: 8 up, 8 in
   pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects
 182 GB used, 78459 MB / 263 GB avail
  832 active+clean

 Dumping the mds daemon shows inodes  inodes_max:

 # ceph daemon mds.ede-c2-mds02 perf dump mds
 {
 mds: {
 request: 21862302,
 reply: 21862302,
 reply_latency: {
 avgcount: 21862302,
 sum: 16728.480772060
 },
 forward: 0,
 dir_fetch: 13,
 dir_commit: 50788,
 dir_split: 0,
 inode_max: 10,
 inodes: 100010,
 inodes_top: 0,
 inodes_bottom: 0,
 inodes_pin_tail: 100010,
 inodes_pinned: 100010,
 inodes_expired: 4308279,
 inodes_with_caps: 8,
 caps: 8,
 subtrees: 2,
 traverse: 30802465,
 traverse_hit: 26394836,
 traverse_forward: 0,
 traverse_discover: 0,
 traverse_dir_fetch: 0,
 traverse_remote_ino: 0,
 traverse_lock: 0,
 load_cent: 2186230200,
 q: 0,
 exported: 0,
 exported_inodes: 0,
 imported: 0,
 imported_inodes: 0
 }
 }

 Once this test finishes and I verify the files were all correctly
 written, I will retest using the SAMBA VFS interface, followed by the
 kernel test.

 Please let me know if there is more info you need and if you want me
 to open a ticket.

 Best regards
 Eric



 On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman
 eric.east...@keepertech.com wrote:
  Thanks John. I will back the test down to the simple case of 1 client
  without the kernel driver and only running NFS Ganesha, and work forward
  till I trip the problem and report my findings.
 
  Eric
 
  On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com
 wrote:
 
 
 
  On 13/07/2015 04:02, Eric Eastman wrote:
 
  Hi John,
 
  I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
  nodes.  This system is using 4 Ceph FS client systems. They all have
  the kernel driver version of CephFS loaded, but none are mounting the
  file system. All 4 clients are using the libcephfs VFS interface to
  Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
  share out the Ceph file system.
 
  # ceph -s
   cluster 6d8aae1e-1125-11e5-a708-001b78e265be
health HEALTH_WARN
   4 near full osd(s)
   mds0: Client ede-c2-gw01 failing to respond to cache
  pressure
   mds0: Client ede-c2-gw02:cephfs failing to respond to
 cache
  pressure
   mds0: Client ede-c2-gw03:cephfs failing to respond to
 cache
  pressure
monmap e1: 3 mons at
 
  {ede-c2-mon01=
 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0
 }
   election epoch 8, quorum 0,1,2
  ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
osdmap e272: 8 osds: 8 up, 8 in
 pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
   212 GB used, 48715 MB / 263 GB avail
832 active+clean
 client io 1379 kB/s rd, 20653 B/s wr, 98 op/s
 
 
  It would help if we knew whether it's the kernel clients or the
 userspace
  clients that are generating the warnings here.  You've probably
 already done
  this, but I'd get rid of any unused kernel 

Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-13 Thread Eric Eastman
Thanks John. I will back the test down to the simple case of 1 client
without the kernel driver and only running NFS Ganesha, and work forward
till I trip the problem and report my findings.

Eric

On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote:



 On 13/07/2015 04:02, Eric Eastman wrote:

 Hi John,

 I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
 nodes.  This system is using 4 Ceph FS client systems. They all have
 the kernel driver version of CephFS loaded, but none are mounting the
 file system. All 4 clients are using the libcephfs VFS interface to
 Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
 share out the Ceph file system.

 # ceph -s
  cluster 6d8aae1e-1125-11e5-a708-001b78e265be
   health HEALTH_WARN
  4 near full osd(s)
  mds0: Client ede-c2-gw01 failing to respond to cache pressure
  mds0: Client ede-c2-gw02:cephfs failing to respond to cache
 pressure
  mds0: Client ede-c2-gw03:cephfs failing to respond to cache
 pressure
   monmap e1: 3 mons at
 {ede-c2-mon01=
 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0
 }
  election epoch 8, quorum 0,1,2
 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
   mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
   osdmap e272: 8 osds: 8 up, 8 in
pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
  212 GB used, 48715 MB / 263 GB avail
   832 active+clean
client io 1379 kB/s rd, 20653 B/s wr, 98 op/s


 It would help if we knew whether it's the kernel clients or the userspace
 clients that are generating the warnings here.  You've probably already
 done this, but I'd get rid of any unused kernel client mounts to simplify
 the situation.

 We haven't tested the cache limit enforcement with NFS Ganesha, so there
 is a decent chance that it is broken.  The ganehsha FSAL is doing
 ll_get/ll_put reference counting on inodes, so it seems quite possible that
 its cache is pinning things that we would otherwise be evicting in response
 to cache pressure.  You mention samba as well,

 You can see if the MDS cache is indeed exceeding its limit by looking at
 the output of:
 ceph daemon mds.daemon id perf dump mds

 ...where the inodes value tells you how many are in the cache, vs.
 inode_max.

 If you can, it would be useful to boil this down to a straightforward test
 case: if you start with a healthy cluster, mount a single ganesha client,
 and do your 5 million file procedure, do you get the warning?  Same for
 samba/kernel mounts -- this is likely to be a client side issue, so we need
 to confirm which client is misbehaving.

 Cheers,
 John



 # cat /proc/version
 Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3
 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19
 UTC 2015

 # ceph -v
 ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)

 The systems are all running Ubuntu Trusty that has been upgraded to
 the 4.1 kernel. This is all physical machines and no VMs.  The test
 run that caused the problem was create and verifying 5 million small
 files.

 We have some tools that flag when Ceph is in a WARN state so it would
 be nice to get rid of this warning.

 Please let me know what additional information you need.

 Thanks,

 Eric

 On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote:

 Thank you John,
 All my server is ubuntu14.04 with 3.16 kernel.
 Not all of clients appear this problem, the cluster seems functioning
 well
 now.
 As you say,i will change the mds_cache_size to 50 from 10 to
 take a
 test, thanks again!

 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com:


 This is usually caused by use of older kernel clients.  I don't remember
 exactly what version it was fixed in, but iirc we've seen the problem
 with
 3.14 and seen it go away with 3.18.

 If your system is otherwise functioning well, this is not a critical
 error
 -- it just means that the MDS might not be able to fully control its
 memory
 usage (i.e. it can exceed mds_cache_size).

 John



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-13 Thread John Spray



On 13/07/2015 04:02, Eric Eastman wrote:

Hi John,

I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
nodes.  This system is using 4 Ceph FS client systems. They all have
the kernel driver version of CephFS loaded, but none are mounting the
file system. All 4 clients are using the libcephfs VFS interface to
Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
share out the Ceph file system.

# ceph -s
 cluster 6d8aae1e-1125-11e5-a708-001b78e265be
  health HEALTH_WARN
 4 near full osd(s)
 mds0: Client ede-c2-gw01 failing to respond to cache pressure
 mds0: Client ede-c2-gw02:cephfs failing to respond to cache 
pressure
 mds0: Client ede-c2-gw03:cephfs failing to respond to cache 
pressure
  monmap e1: 3 mons at
{ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
 election epoch 8, quorum 0,1,2
ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
  mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
  osdmap e272: 8 osds: 8 up, 8 in
   pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
 212 GB used, 48715 MB / 263 GB avail
  832 active+clean
   client io 1379 kB/s rd, 20653 B/s wr, 98 op/s


It would help if we knew whether it's the kernel clients or the 
userspace clients that are generating the warnings here.  You've 
probably already done this, but I'd get rid of any unused kernel client 
mounts to simplify the situation.


We haven't tested the cache limit enforcement with NFS Ganesha, so there 
is a decent chance that it is broken.  The ganehsha FSAL is doing 
ll_get/ll_put reference counting on inodes, so it seems quite possible 
that its cache is pinning things that we would otherwise be evicting in 
response to cache pressure.  You mention samba as well,


You can see if the MDS cache is indeed exceeding its limit by looking at 
the output of:

ceph daemon mds.daemon id perf dump mds

...where the inodes value tells you how many are in the cache, vs. 
inode_max.


If you can, it would be useful to boil this down to a straightforward 
test case: if you start with a healthy cluster, mount a single ganesha 
client, and do your 5 million file procedure, do you get the warning?  
Same for samba/kernel mounts -- this is likely to be a client side 
issue, so we need to confirm which client is misbehaving.


Cheers,
John



# cat /proc/version
Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19
UTC 2015

# ceph -v
ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)

The systems are all running Ubuntu Trusty that has been upgraded to
the 4.1 kernel. This is all physical machines and no VMs.  The test
run that caused the problem was create and verifying 5 million small
files.

We have some tools that flag when Ceph is in a WARN state so it would
be nice to get rid of this warning.

Please let me know what additional information you need.

Thanks,

Eric

On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote:

Thank you John,
All my server is ubuntu14.04 with 3.16 kernel.
Not all of clients appear this problem, the cluster seems functioning well
now.
As you say,i will change the mds_cache_size to 50 from 10 to take a
test, thanks again!

2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com:


This is usually caused by use of older kernel clients.  I don't remember
exactly what version it was fixed in, but iirc we've seen the problem with
3.14 and seen it go away with 3.18.

If your system is otherwise functioning well, this is not a critical error
-- it just means that the MDS might not be able to fully control its memory
usage (i.e. it can exceed mds_cache_size).

John



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-12 Thread Eric Eastman
Hi John,

I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
nodes.  This system is using 4 Ceph FS client systems. They all have
the kernel driver version of CephFS loaded, but none are mounting the
file system. All 4 clients are using the libcephfs VFS interface to
Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
share out the Ceph file system.

# ceph -s
cluster 6d8aae1e-1125-11e5-a708-001b78e265be
 health HEALTH_WARN
4 near full osd(s)
mds0: Client ede-c2-gw01 failing to respond to cache pressure
mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure
mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure
 monmap e1: 3 mons at
{ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
election epoch 8, quorum 0,1,2
ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
 osdmap e272: 8 osds: 8 up, 8 in
  pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
212 GB used, 48715 MB / 263 GB avail
 832 active+clean
  client io 1379 kB/s rd, 20653 B/s wr, 98 op/s

# cat /proc/version
Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19
UTC 2015

# ceph -v
ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)

The systems are all running Ubuntu Trusty that has been upgraded to
the 4.1 kernel. This is all physical machines and no VMs.  The test
run that caused the problem was create and verifying 5 million small
files.

We have some tools that flag when Ceph is in a WARN state so it would
be nice to get rid of this warning.

Please let me know what additional information you need.

Thanks,

Eric

On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote:
 Thank you John,
 All my server is ubuntu14.04 with 3.16 kernel.
 Not all of clients appear this problem, the cluster seems functioning well
 now.
 As you say,i will change the mds_cache_size to 50 from 10 to take a
 test, thanks again!

 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com:


 This is usually caused by use of older kernel clients.  I don't remember
 exactly what version it was fixed in, but iirc we've seen the problem with
 3.14 and seen it go away with 3.18.

 If your system is otherwise functioning well, this is not a critical error
 -- it just means that the MDS might not be able to fully control its memory
 usage (i.e. it can exceed mds_cache_size).

 John

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-12 Thread Eric Eastman
In the last email, I stated the clients were not mounted using the
ceph file system kernel driver. Re-checking the client systems,  the
file systems are mounted, but all the IO is going through Ganesha NFS
using the ceph file system library interface.

On Sun, Jul 12, 2015 at 9:02 PM, Eric Eastman
eric.east...@keepertech.com wrote:
 Hi John,

 I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
 nodes.  This system is using 4 Ceph FS client systems. They all have
 the kernel driver version of CephFS loaded, but none are mounting the
 file system. All 4 clients are using the libcephfs VFS interface to
 Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
 share out the Ceph file system.

 # ceph -s
 cluster 6d8aae1e-1125-11e5-a708-001b78e265be
  health HEALTH_WARN
 4 near full osd(s)
 mds0: Client ede-c2-gw01 failing to respond to cache pressure
 mds0: Client ede-c2-gw02:cephfs failing to respond to cache 
 pressure
 mds0: Client ede-c2-gw03:cephfs failing to respond to cache 
 pressure
  monmap e1: 3 mons at
 {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0}
 election epoch 8, quorum 0,1,2
 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
  mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
  osdmap e272: 8 osds: 8 up, 8 in
   pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
 212 GB used, 48715 MB / 263 GB avail
  832 active+clean
   client io 1379 kB/s rd, 20653 B/s wr, 98 op/s

 # cat /proc/version
 Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3
 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19
 UTC 2015

 # ceph -v
 ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)

 The systems are all running Ubuntu Trusty that has been upgraded to
 the 4.1 kernel. This is all physical machines and no VMs.  The test
 run that caused the problem was create and verifying 5 million small
 files.

 We have some tools that flag when Ceph is in a WARN state so it would
 be nice to get rid of this warning.

 Please let me know what additional information you need.

 Thanks,

 Eric

 On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote:
 Thank you John,
 All my server is ubuntu14.04 with 3.16 kernel.
 Not all of clients appear this problem, the cluster seems functioning well
 now.
 As you say,i will change the mds_cache_size to 50 from 10 to take a
 test, thanks again!

 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com:


 This is usually caused by use of older kernel clients.  I don't remember
 exactly what version it was fixed in, but iirc we've seen the problem with
 3.14 and seen it go away with 3.18.

 If your system is otherwise functioning well, this is not a critical error
 -- it just means that the MDS might not be able to fully control its memory
 usage (i.e. it can exceed mds_cache_size).

 John

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-10 Thread John Spray


This is usually caused by use of older kernel clients.  I don't remember 
exactly what version it was fixed in, but iirc we've seen the problem 
with 3.14 and seen it go away with 3.18.


If your system is otherwise functioning well, this is not a critical 
error -- it just means that the MDS might not be able to fully control 
its memory usage (i.e. it can exceed mds_cache_size).


John

On 10/07/2015 05:25, 谷枫 wrote:

hi,
I use CephFS in production environnement with 7osd,1mds,3mon now.
So far so good,but i have a problem with it today.
The ceph status report this:
cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228
  health HEALTH_WARN
 mds0: Client 34271 failing to respond to cache pressure
 mds0: Client 74175 failing to respond to cache pressure
 mds0: Client 74181 failing to respond to cache pressure
 mds0: Client 34247 failing to respond to cache pressure
 mds0: Client 64162 failing to respond to cache pressure
 mds0: Client 136744 failing to respond to cache pressure
  monmap e2: 3 mons at 
{node01=10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0  
http://10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0}
 election epoch 186, quorum 0,1,2 node01,node02,node03
  mdsmap e46: 1/1/1 up {0=tree01=up:active}
  osdmap e717: 7 osds: 7 up, 7 in
   pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects
 138 GB used, 1364 GB / 1502 GB avail
  264 active+clean
   client io 1018 B/s rd, 1273 B/s wr, 0 op/s

I add two osds with the version 0.94.2 and other old osds is 0.94.1 yesterday.
So the question is does this matter?
What's the warning mean ,and how can i solve this problem.Thanks!
This is my cluster config message with mds:
 name: mds.tree01,
 debug_mds: 1\/5,
 debug_mds_balancer: 1\/5,
 debug_mds_locker: 1\/5,
 debug_mds_log: 1\/5,
 debug_mds_log_expire: 1\/5,
 debug_mds_migrator: 1\/5,
 admin_socket: \/var\/run\/ceph\/ceph-mds.tree01.asok,
 log_file: \/var\/log\/ceph\/ceph-mds.tree01.log,
 keyring: \/var\/lib\/ceph\/mds\/ceph-tree01\/keyring,
 mon_max_mdsmap_epochs: 500,
 mon_mds_force_trim_to: 0,
 mon_debug_dump_location: \/var\/log\/ceph\/ceph-mds.tree01.tdump,
 client_use_random_mds: false,
 mds_data: \/var\/lib\/ceph\/mds\/ceph-tree01,
 mds_max_file_size: 1099511627776,
 mds_cache_size: 10,
 mds_cache_mid: 0.7,
 mds_max_file_recover: 32,
 mds_mem_max: 1048576,
 mds_dir_max_commit_size: 10,
 mds_decay_halflife: 5,
 mds_beacon_interval: 4,
 mds_beacon_grace: 15,
 mds_enforce_unique_name: true,
 mds_blacklist_interval: 1440,
 mds_session_timeout: 120,
 mds_revoke_cap_timeout: 60,
 mds_recall_state_timeout: 60,
 mds_freeze_tree_timeout: 30,
 mds_session_autoclose: 600,
 mds_health_summarize_threshold: 10,
 mds_reconnect_timeout: 45,
 mds_tick_interval: 5,
 mds_dirstat_min_interval: 1,
 mds_scatter_nudge_interval: 5,
 mds_client_prealloc_inos: 1000,
 mds_early_reply: true,
 mds_default_dir_hash: 2,
 mds_log: true,
 mds_log_skip_corrupt_events: false,
 mds_log_max_events: -1,
 mds_log_events_per_segment: 1024,
 mds_log_segment_size: 0,
 mds_log_max_segments: 30,
 mds_log_max_expiring: 20,
 mds_bal_sample_interval: 3,
 mds_bal_replicate_threshold: 8000,
 mds_bal_unreplicate_threshold: 0,
 mds_bal_frag: false,
 mds_bal_split_size: 1,
 mds_bal_split_rd: 25000,
 mds_bal_split_wr: 1,
 mds_bal_split_bits: 3,
 mds_bal_merge_size: 50,
 mds_bal_merge_rd: 1000,
 mds_bal_merge_wr: 1000,
 mds_bal_interval: 10,
 mds_bal_fragment_interval: 5,
 mds_bal_idle_threshold: 0,
 mds_bal_max: -1,
 mds_bal_max_until: -1,
 mds_bal_mode: 0,
 mds_bal_min_rebalance: 0.1,
 mds_bal_min_start: 0.2,
 mds_bal_need_min: 0.8,
 mds_bal_need_max: 1.2,
 mds_bal_midchunk: 0.3,
 mds_bal_minchunk: 0.001,
 mds_bal_target_removal_min: 5,
 mds_bal_target_removal_max: 10,
 mds_replay_interval: 1,
 mds_shutdown_check: 0,
 mds_thrash_exports: 0,
 mds_thrash_fragments: 0,
 mds_dump_cache_on_map: false,
 mds_dump_cache_after_rejoin: false,
 mds_verify_scatter: false,
 mds_debug_scatterstat: false,
 mds_debug_frag: false,
 mds_debug_auth_pins: false,
 mds_debug_subtrees: false,
 mds_kill_mdstable_at: 0,
 mds_kill_export_at: 0,
 mds_kill_import_at: 0,
 mds_kill_link_at: 0,
 mds_kill_rename_at: 0,
 mds_kill_openc_at: 0,
 mds_kill_journal_at: 0,
 mds_kill_journal_expire_at: 0,
 mds_kill_journal_replay_at: 0,
 mds_journal_format: 1,
 mds_kill_create_at: 0,
 mds_inject_traceless_reply_probability: 0,
 mds_wipe_sessions: false,
 mds_wipe_ino_prealloc: false,
 mds_skip_ino: 0,
 max_mds: 1,
 

Re: [ceph-users] mds0: Client failing to respond to cache pressure

2015-07-10 Thread 谷枫
Thank you John,
All my server is ubuntu14.04 with 3.16 kernel.
Not all of clients appear this problem, the cluster seems functioning well
now.
As you say,i will change the mds_cache_size to 50 from 10 to take a
test, thanks again!

2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com:


 This is usually caused by use of older kernel clients.  I don't remember
 exactly what version it was fixed in, but iirc we've seen the problem with
 3.14 and seen it go away with 3.18.

 If your system is otherwise functioning well, this is not a critical error
 -- it just means that the MDS might not be able to fully control its memory
 usage (i.e. it can exceed mds_cache_size).

 John

 On 10/07/2015 05:25, 谷枫 wrote:

 hi,
 I use CephFS in production environnement with 7osd,1mds,3mon now.
 So far so good,but i have a problem with it today.
 The ceph status report this:
 cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228
   health HEALTH_WARN
  mds0: Client 34271 failing to respond to cache pressure
  mds0: Client 74175 failing to respond to cache pressure
  mds0: Client 74181 failing to respond to cache pressure
  mds0: Client 34247 failing to respond to cache pressure
  mds0: Client 64162 failing to respond to cache pressure
  mds0: Client 136744 failing to respond to cache pressure
   monmap e2: 3 mons at {node01=
 10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0  
 http://10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0}

  election epoch 186, quorum 0,1,2 node01,node02,node03
   mdsmap e46: 1/1/1 up {0=tree01=up:active}
   osdmap e717: 7 osds: 7 up, 7 in
pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects
  138 GB used, 1364 GB / 1502 GB avail
   264 active+clean
client io 1018 B/s rd, 1273 B/s wr, 0 op/s

 I add two osds with the version 0.94.2 and other old osds is 0.94.1
 yesterday.
 So the question is does this matter?
 What's the warning mean ,and how can i solve this problem.Thanks!
 This is my cluster config message with mds:
  name: mds.tree01,
  debug_mds: 1\/5,
  debug_mds_balancer: 1\/5,
  debug_mds_locker: 1\/5,
  debug_mds_log: 1\/5,
  debug_mds_log_expire: 1\/5,
  debug_mds_migrator: 1\/5,
  admin_socket: \/var\/run\/ceph\/ceph-mds.tree01.asok,
  log_file: \/var\/log\/ceph\/ceph-mds.tree01.log,
  keyring: \/var\/lib\/ceph\/mds\/ceph-tree01\/keyring,
  mon_max_mdsmap_epochs: 500,
  mon_mds_force_trim_to: 0,
  mon_debug_dump_location: \/var\/log\/ceph\/ceph-mds.tree01.tdump,
  client_use_random_mds: false,
  mds_data: \/var\/lib\/ceph\/mds\/ceph-tree01,
  mds_max_file_size: 1099511627776,
  mds_cache_size: 10,
  mds_cache_mid: 0.7,
  mds_max_file_recover: 32,
  mds_mem_max: 1048576,
  mds_dir_max_commit_size: 10,
  mds_decay_halflife: 5,
  mds_beacon_interval: 4,
  mds_beacon_grace: 15,
  mds_enforce_unique_name: true,
  mds_blacklist_interval: 1440,
  mds_session_timeout: 120,
  mds_revoke_cap_timeout: 60,
  mds_recall_state_timeout: 60,
  mds_freeze_tree_timeout: 30,
  mds_session_autoclose: 600,
  mds_health_summarize_threshold: 10,
  mds_reconnect_timeout: 45,
  mds_tick_interval: 5,
  mds_dirstat_min_interval: 1,
  mds_scatter_nudge_interval: 5,
  mds_client_prealloc_inos: 1000,
  mds_early_reply: true,
  mds_default_dir_hash: 2,
  mds_log: true,
  mds_log_skip_corrupt_events: false,
  mds_log_max_events: -1,
  mds_log_events_per_segment: 1024,
  mds_log_segment_size: 0,
  mds_log_max_segments: 30,
  mds_log_max_expiring: 20,
  mds_bal_sample_interval: 3,
  mds_bal_replicate_threshold: 8000,
  mds_bal_unreplicate_threshold: 0,
  mds_bal_frag: false,
  mds_bal_split_size: 1,
  mds_bal_split_rd: 25000,
  mds_bal_split_wr: 1,
  mds_bal_split_bits: 3,
  mds_bal_merge_size: 50,
  mds_bal_merge_rd: 1000,
  mds_bal_merge_wr: 1000,
  mds_bal_interval: 10,
  mds_bal_fragment_interval: 5,
  mds_bal_idle_threshold: 0,
  mds_bal_max: -1,
  mds_bal_max_until: -1,
  mds_bal_mode: 0,
  mds_bal_min_rebalance: 0.1,
  mds_bal_min_start: 0.2,
  mds_bal_need_min: 0.8,
  mds_bal_need_max: 1.2,
  mds_bal_midchunk: 0.3,
  mds_bal_minchunk: 0.001,
  mds_bal_target_removal_min: 5,
  mds_bal_target_removal_max: 10,
  mds_replay_interval: 1,
  mds_shutdown_check: 0,
  mds_thrash_exports: 0,
  mds_thrash_fragments: 0,
  mds_dump_cache_on_map: false,
  mds_dump_cache_after_rejoin: false,
  mds_verify_scatter: false,
  mds_debug_scatterstat: false,
  mds_debug_frag: false,
  mds_debug_auth_pins: false,
  mds_debug_subtrees: false,
  mds_kill_mdstable_at: 0,
  mds_kill_export_at: 0,
  

[ceph-users] mds0: Client failing to respond to cache pressure

2015-07-09 Thread 谷枫
hi,

I use CephFS in production environnement with 7osd,1mds,3mon now.

So far so good,but i have a problem with it today.

The ceph status report this:

cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228
 health HEALTH_WARN
mds0: Client 34271 failing to respond to cache pressure
mds0: Client 74175 failing to respond to cache pressure
mds0: Client 74181 failing to respond to cache pressure
mds0: Client 34247 failing to respond to cache pressure
mds0: Client 64162 failing to respond to cache pressure
mds0: Client 136744 failing to respond to cache pressure
 monmap e2: 3 mons at
{node01=10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0}
election epoch 186, quorum 0,1,2 node01,node02,node03
 mdsmap e46: 1/1/1 up {0=tree01=up:active}
 osdmap e717: 7 osds: 7 up, 7 in
  pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects
138 GB used, 1364 GB / 1502 GB avail
 264 active+clean
  client io 1018 B/s rd, 1273 B/s wr, 0 op/s


I add two osds with the version 0.94.2 and other old osds is 0.94.1 yesterday.

So the question is does this matter?

What's the warning mean ,and how can i solve this problem.Thanks!

This is my cluster config message with mds:

name: mds.tree01,
debug_mds: 1\/5,
debug_mds_balancer: 1\/5,
debug_mds_locker: 1\/5,
debug_mds_log: 1\/5,
debug_mds_log_expire: 1\/5,
debug_mds_migrator: 1\/5,
admin_socket: \/var\/run\/ceph\/ceph-mds.tree01.asok,
log_file: \/var\/log\/ceph\/ceph-mds.tree01.log,
keyring: \/var\/lib\/ceph\/mds\/ceph-tree01\/keyring,
mon_max_mdsmap_epochs: 500,
mon_mds_force_trim_to: 0,
mon_debug_dump_location: \/var\/log\/ceph\/ceph-mds.tree01.tdump,
client_use_random_mds: false,
mds_data: \/var\/lib\/ceph\/mds\/ceph-tree01,
mds_max_file_size: 1099511627776,
mds_cache_size: 10,
mds_cache_mid: 0.7,
mds_max_file_recover: 32,
mds_mem_max: 1048576,
mds_dir_max_commit_size: 10,
mds_decay_halflife: 5,
mds_beacon_interval: 4,
mds_beacon_grace: 15,
mds_enforce_unique_name: true,
mds_blacklist_interval: 1440,
mds_session_timeout: 120,
mds_revoke_cap_timeout: 60,
mds_recall_state_timeout: 60,
mds_freeze_tree_timeout: 30,
mds_session_autoclose: 600,
mds_health_summarize_threshold: 10,
mds_reconnect_timeout: 45,
mds_tick_interval: 5,
mds_dirstat_min_interval: 1,
mds_scatter_nudge_interval: 5,
mds_client_prealloc_inos: 1000,
mds_early_reply: true,
mds_default_dir_hash: 2,
mds_log: true,
mds_log_skip_corrupt_events: false,
mds_log_max_events: -1,
mds_log_events_per_segment: 1024,
mds_log_segment_size: 0,
mds_log_max_segments: 30,
mds_log_max_expiring: 20,
mds_bal_sample_interval: 3,
mds_bal_replicate_threshold: 8000,
mds_bal_unreplicate_threshold: 0,
mds_bal_frag: false,
mds_bal_split_size: 1,
mds_bal_split_rd: 25000,
mds_bal_split_wr: 1,
mds_bal_split_bits: 3,
mds_bal_merge_size: 50,
mds_bal_merge_rd: 1000,
mds_bal_merge_wr: 1000,
mds_bal_interval: 10,
mds_bal_fragment_interval: 5,
mds_bal_idle_threshold: 0,
mds_bal_max: -1,
mds_bal_max_until: -1,
mds_bal_mode: 0,
mds_bal_min_rebalance: 0.1,
mds_bal_min_start: 0.2,
mds_bal_need_min: 0.8,
mds_bal_need_max: 1.2,
mds_bal_midchunk: 0.3,
mds_bal_minchunk: 0.001,
mds_bal_target_removal_min: 5,
mds_bal_target_removal_max: 10,
mds_replay_interval: 1,
mds_shutdown_check: 0,
mds_thrash_exports: 0,
mds_thrash_fragments: 0,
mds_dump_cache_on_map: false,
mds_dump_cache_after_rejoin: false,
mds_verify_scatter: false,
mds_debug_scatterstat: false,
mds_debug_frag: false,
mds_debug_auth_pins: false,
mds_debug_subtrees: false,
mds_kill_mdstable_at: 0,
mds_kill_export_at: 0,
mds_kill_import_at: 0,
mds_kill_link_at: 0,
mds_kill_rename_at: 0,
mds_kill_openc_at: 0,
mds_kill_journal_at: 0,
mds_kill_journal_expire_at: 0,
mds_kill_journal_replay_at: 0,
mds_journal_format: 1,
mds_kill_create_at: 0,
mds_inject_traceless_reply_probability: 0,
mds_wipe_sessions: false,
mds_wipe_ino_prealloc: false,
mds_skip_ino: 0,
max_mds: 1,
mds_standby_for_name: ,
mds_standby_for_rank: -1,
mds_standby_replay: false,
mds_enable_op_tracker: true,
mds_op_history_size: 20,
mds_op_history_duration: 600,
mds_op_complaint_time: 30,
mds_op_log_threshold: 5,
mds_snap_min_uid: 0,
mds_snap_max_uid: 65536,
mds_verify_backtrace: 1,
mds_action_on_write_error: 1,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com