Re: [ceph-users] mds0: Client failing to respond to cache pressure
On 15/07/15 04:06, Eric Eastman wrote: Hi John, I cut the test down to a single client running only Ganesha NFS without any ceph drivers loaded on the Ceph FS client. After deleting all the files in the Ceph file system, rebooting all the nodes, I restarted the create 5 million file test using 2 NFS clients to the one Ceph file system node running Ganesha NFS. After a couple hours I am seeing the client ede-c2-gw01 failing to respond to cache pressure error: Thanks -- that's a very useful datapoint. I've created a ticket here: http://tracker.ceph.com/issues/12334 Looking forward to seeing if samba has the same issue. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
Hi John, I cut the test down to a single client running only Ganesha NFS without any ceph drivers loaded on the Ceph FS client. After deleting all the files in the Ceph file system, rebooting all the nodes, I restarted the create 5 million file test using 2 NFS clients to the one Ceph file system node running Ganesha NFS. After a couple hours I am seeing the client ede-c2-gw01 failing to respond to cache pressure error: $ ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN mds0: Client ede-c2-gw01 failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 22, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby osdmap e323: 8 osds: 8 up, 8 in pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects 182 GB used, 78459 MB / 263 GB avail 832 active+clean Dumping the mds daemon shows inodes inodes_max: # ceph daemon mds.ede-c2-mds02 perf dump mds { mds: { request: 21862302, reply: 21862302, reply_latency: { avgcount: 21862302, sum: 16728.480772060 }, forward: 0, dir_fetch: 13, dir_commit: 50788, dir_split: 0, inode_max: 10, inodes: 100010, inodes_top: 0, inodes_bottom: 0, inodes_pin_tail: 100010, inodes_pinned: 100010, inodes_expired: 4308279, inodes_with_caps: 8, caps: 8, subtrees: 2, traverse: 30802465, traverse_hit: 26394836, traverse_forward: 0, traverse_discover: 0, traverse_dir_fetch: 0, traverse_remote_ino: 0, traverse_lock: 0, load_cent: 2186230200, q: 0, exported: 0, exported_inodes: 0, imported: 0, imported_inodes: 0 } } Once this test finishes and I verify the files were all correctly written, I will retest using the SAMBA VFS interface, followed by the kernel test. Please let me know if there is more info you need and if you want me to open a ticket. Best regards Eric On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman eric.east...@keepertech.com wrote: Thanks John. I will back the test down to the simple case of 1 client without the kernel driver and only running NFS Ganesha, and work forward till I trip the problem and report my findings. Eric On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote: On 13/07/2015 04:02, Eric Eastman wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s It would help if we knew whether it's the kernel clients or the userspace clients that are generating the warnings here. You've probably already done this, but I'd get rid of any unused kernel client mounts to simplify the situation. We haven't tested the cache limit enforcement with NFS Ganesha, so there is a decent chance that it is broken. The ganehsha FSAL is doing ll_get/ll_put reference counting on inodes, so it seems quite possible that its cache is pinning things that we would otherwise be evicting in response to cache pressure. You mention samba as well, You can see if the MDS cache is indeed exceeding its limit by looking at the output of: ceph daemon mds.daemon id perf dump mds ...where the inodes value tells you how many are in the cache, vs. inode_max. If you can, it would be useful to boil this down to a straightforward test case: if you start with a healthy cluster, mount a single ganesha client, and do your 5 million file procedure, do you get the warning?
Re: [ceph-users] mds0: Client failing to respond to cache pressure
I change the mds_cache_size to 50 from 10 get rid of the WARN temporary. Now dumping the mds daemon shows like this: inode_max: 50, inodes: 124213, But i have no idea if the indoes rises more than 50 , change the mds_cache_size again? Thanks. 2015-07-15 13:34 GMT+08:00 谷枫 feiche...@gmail.com: I change the mds_cache_size to 50 from 10 get rid of the WARN temporary. Now dumping the mds daemon shows like this: inode_max: 50, inodes: 124213, But i have no idea if the indoes rises more than 50 , change the mds_cache_size again? Thanks. 2015-07-15 11:06 GMT+08:00 Eric Eastman eric.east...@keepertech.com: Hi John, I cut the test down to a single client running only Ganesha NFS without any ceph drivers loaded on the Ceph FS client. After deleting all the files in the Ceph file system, rebooting all the nodes, I restarted the create 5 million file test using 2 NFS clients to the one Ceph file system node running Ganesha NFS. After a couple hours I am seeing the client ede-c2-gw01 failing to respond to cache pressure error: $ ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN mds0: Client ede-c2-gw01 failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01= 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0 } election epoch 22, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby osdmap e323: 8 osds: 8 up, 8 in pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects 182 GB used, 78459 MB / 263 GB avail 832 active+clean Dumping the mds daemon shows inodes inodes_max: # ceph daemon mds.ede-c2-mds02 perf dump mds { mds: { request: 21862302, reply: 21862302, reply_latency: { avgcount: 21862302, sum: 16728.480772060 }, forward: 0, dir_fetch: 13, dir_commit: 50788, dir_split: 0, inode_max: 10, inodes: 100010, inodes_top: 0, inodes_bottom: 0, inodes_pin_tail: 100010, inodes_pinned: 100010, inodes_expired: 4308279, inodes_with_caps: 8, caps: 8, subtrees: 2, traverse: 30802465, traverse_hit: 26394836, traverse_forward: 0, traverse_discover: 0, traverse_dir_fetch: 0, traverse_remote_ino: 0, traverse_lock: 0, load_cent: 2186230200, q: 0, exported: 0, exported_inodes: 0, imported: 0, imported_inodes: 0 } } Once this test finishes and I verify the files were all correctly written, I will retest using the SAMBA VFS interface, followed by the kernel test. Please let me know if there is more info you need and if you want me to open a ticket. Best regards Eric On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman eric.east...@keepertech.com wrote: Thanks John. I will back the test down to the simple case of 1 client without the kernel driver and only running NFS Ganesha, and work forward till I trip the problem and report my findings. Eric On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote: On 13/07/2015 04:02, Eric Eastman wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01= 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0 } election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s It would help if we knew whether it's the kernel clients or the userspace clients that are generating the warnings here. You've probably already done this, but I'd get rid of any unused kernel
Re: [ceph-users] mds0: Client failing to respond to cache pressure
Thanks John. I will back the test down to the simple case of 1 client without the kernel driver and only running NFS Ganesha, and work forward till I trip the problem and report my findings. Eric On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote: On 13/07/2015 04:02, Eric Eastman wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01= 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0 } election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s It would help if we knew whether it's the kernel clients or the userspace clients that are generating the warnings here. You've probably already done this, but I'd get rid of any unused kernel client mounts to simplify the situation. We haven't tested the cache limit enforcement with NFS Ganesha, so there is a decent chance that it is broken. The ganehsha FSAL is doing ll_get/ll_put reference counting on inodes, so it seems quite possible that its cache is pinning things that we would otherwise be evicting in response to cache pressure. You mention samba as well, You can see if the MDS cache is indeed exceeding its limit by looking at the output of: ceph daemon mds.daemon id perf dump mds ...where the inodes value tells you how many are in the cache, vs. inode_max. If you can, it would be useful to boil this down to a straightforward test case: if you start with a healthy cluster, mount a single ganesha client, and do your 5 million file procedure, do you get the warning? Same for samba/kernel mounts -- this is likely to be a client side issue, so we need to confirm which client is misbehaving. Cheers, John # cat /proc/version Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19 UTC 2015 # ceph -v ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64) The systems are all running Ubuntu Trusty that has been upgraded to the 4.1 kernel. This is all physical machines and no VMs. The test run that caused the problem was create and verifying 5 million small files. We have some tools that flag when Ceph is in a WARN state so it would be nice to get rid of this warning. Please let me know what additional information you need. Thanks, Eric On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote: Thank you John, All my server is ubuntu14.04 with 3.16 kernel. Not all of clients appear this problem, the cluster seems functioning well now. As you say,i will change the mds_cache_size to 50 from 10 to take a test, thanks again! 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com: This is usually caused by use of older kernel clients. I don't remember exactly what version it was fixed in, but iirc we've seen the problem with 3.14 and seen it go away with 3.18. If your system is otherwise functioning well, this is not a critical error -- it just means that the MDS might not be able to fully control its memory usage (i.e. it can exceed mds_cache_size). John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
On 13/07/2015 04:02, Eric Eastman wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s It would help if we knew whether it's the kernel clients or the userspace clients that are generating the warnings here. You've probably already done this, but I'd get rid of any unused kernel client mounts to simplify the situation. We haven't tested the cache limit enforcement with NFS Ganesha, so there is a decent chance that it is broken. The ganehsha FSAL is doing ll_get/ll_put reference counting on inodes, so it seems quite possible that its cache is pinning things that we would otherwise be evicting in response to cache pressure. You mention samba as well, You can see if the MDS cache is indeed exceeding its limit by looking at the output of: ceph daemon mds.daemon id perf dump mds ...where the inodes value tells you how many are in the cache, vs. inode_max. If you can, it would be useful to boil this down to a straightforward test case: if you start with a healthy cluster, mount a single ganesha client, and do your 5 million file procedure, do you get the warning? Same for samba/kernel mounts -- this is likely to be a client side issue, so we need to confirm which client is misbehaving. Cheers, John # cat /proc/version Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19 UTC 2015 # ceph -v ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64) The systems are all running Ubuntu Trusty that has been upgraded to the 4.1 kernel. This is all physical machines and no VMs. The test run that caused the problem was create and verifying 5 million small files. We have some tools that flag when Ceph is in a WARN state so it would be nice to get rid of this warning. Please let me know what additional information you need. Thanks, Eric On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote: Thank you John, All my server is ubuntu14.04 with 3.16 kernel. Not all of clients appear this problem, the cluster seems functioning well now. As you say,i will change the mds_cache_size to 50 from 10 to take a test, thanks again! 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com: This is usually caused by use of older kernel clients. I don't remember exactly what version it was fixed in, but iirc we've seen the problem with 3.14 and seen it go away with 3.18. If your system is otherwise functioning well, this is not a critical error -- it just means that the MDS might not be able to fully control its memory usage (i.e. it can exceed mds_cache_size). John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s # cat /proc/version Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19 UTC 2015 # ceph -v ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64) The systems are all running Ubuntu Trusty that has been upgraded to the 4.1 kernel. This is all physical machines and no VMs. The test run that caused the problem was create and verifying 5 million small files. We have some tools that flag when Ceph is in a WARN state so it would be nice to get rid of this warning. Please let me know what additional information you need. Thanks, Eric On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote: Thank you John, All my server is ubuntu14.04 with 3.16 kernel. Not all of clients appear this problem, the cluster seems functioning well now. As you say,i will change the mds_cache_size to 50 from 10 to take a test, thanks again! 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com: This is usually caused by use of older kernel clients. I don't remember exactly what version it was fixed in, but iirc we've seen the problem with 3.14 and seen it go away with 3.18. If your system is otherwise functioning well, this is not a critical error -- it just means that the MDS might not be able to fully control its memory usage (i.e. it can exceed mds_cache_size). John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
In the last email, I stated the clients were not mounted using the ceph file system kernel driver. Re-checking the client systems, the file systems are mounted, but all the IO is going through Ganesha NFS using the ceph file system library interface. On Sun, Jul 12, 2015 at 9:02 PM, Eric Eastman eric.east...@keepertech.com wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s # cat /proc/version Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19 UTC 2015 # ceph -v ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64) The systems are all running Ubuntu Trusty that has been upgraded to the 4.1 kernel. This is all physical machines and no VMs. The test run that caused the problem was create and verifying 5 million small files. We have some tools that flag when Ceph is in a WARN state so it would be nice to get rid of this warning. Please let me know what additional information you need. Thanks, Eric On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 feiche...@gmail.com wrote: Thank you John, All my server is ubuntu14.04 with 3.16 kernel. Not all of clients appear this problem, the cluster seems functioning well now. As you say,i will change the mds_cache_size to 50 from 10 to take a test, thanks again! 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com: This is usually caused by use of older kernel clients. I don't remember exactly what version it was fixed in, but iirc we've seen the problem with 3.14 and seen it go away with 3.18. If your system is otherwise functioning well, this is not a critical error -- it just means that the MDS might not be able to fully control its memory usage (i.e. it can exceed mds_cache_size). John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
This is usually caused by use of older kernel clients. I don't remember exactly what version it was fixed in, but iirc we've seen the problem with 3.14 and seen it go away with 3.18. If your system is otherwise functioning well, this is not a critical error -- it just means that the MDS might not be able to fully control its memory usage (i.e. it can exceed mds_cache_size). John On 10/07/2015 05:25, 谷枫 wrote: hi, I use CephFS in production environnement with 7osd,1mds,3mon now. So far so good,but i have a problem with it today. The ceph status report this: cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228 health HEALTH_WARN mds0: Client 34271 failing to respond to cache pressure mds0: Client 74175 failing to respond to cache pressure mds0: Client 74181 failing to respond to cache pressure mds0: Client 34247 failing to respond to cache pressure mds0: Client 64162 failing to respond to cache pressure mds0: Client 136744 failing to respond to cache pressure monmap e2: 3 mons at {node01=10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0 http://10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0} election epoch 186, quorum 0,1,2 node01,node02,node03 mdsmap e46: 1/1/1 up {0=tree01=up:active} osdmap e717: 7 osds: 7 up, 7 in pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects 138 GB used, 1364 GB / 1502 GB avail 264 active+clean client io 1018 B/s rd, 1273 B/s wr, 0 op/s I add two osds with the version 0.94.2 and other old osds is 0.94.1 yesterday. So the question is does this matter? What's the warning mean ,and how can i solve this problem.Thanks! This is my cluster config message with mds: name: mds.tree01, debug_mds: 1\/5, debug_mds_balancer: 1\/5, debug_mds_locker: 1\/5, debug_mds_log: 1\/5, debug_mds_log_expire: 1\/5, debug_mds_migrator: 1\/5, admin_socket: \/var\/run\/ceph\/ceph-mds.tree01.asok, log_file: \/var\/log\/ceph\/ceph-mds.tree01.log, keyring: \/var\/lib\/ceph\/mds\/ceph-tree01\/keyring, mon_max_mdsmap_epochs: 500, mon_mds_force_trim_to: 0, mon_debug_dump_location: \/var\/log\/ceph\/ceph-mds.tree01.tdump, client_use_random_mds: false, mds_data: \/var\/lib\/ceph\/mds\/ceph-tree01, mds_max_file_size: 1099511627776, mds_cache_size: 10, mds_cache_mid: 0.7, mds_max_file_recover: 32, mds_mem_max: 1048576, mds_dir_max_commit_size: 10, mds_decay_halflife: 5, mds_beacon_interval: 4, mds_beacon_grace: 15, mds_enforce_unique_name: true, mds_blacklist_interval: 1440, mds_session_timeout: 120, mds_revoke_cap_timeout: 60, mds_recall_state_timeout: 60, mds_freeze_tree_timeout: 30, mds_session_autoclose: 600, mds_health_summarize_threshold: 10, mds_reconnect_timeout: 45, mds_tick_interval: 5, mds_dirstat_min_interval: 1, mds_scatter_nudge_interval: 5, mds_client_prealloc_inos: 1000, mds_early_reply: true, mds_default_dir_hash: 2, mds_log: true, mds_log_skip_corrupt_events: false, mds_log_max_events: -1, mds_log_events_per_segment: 1024, mds_log_segment_size: 0, mds_log_max_segments: 30, mds_log_max_expiring: 20, mds_bal_sample_interval: 3, mds_bal_replicate_threshold: 8000, mds_bal_unreplicate_threshold: 0, mds_bal_frag: false, mds_bal_split_size: 1, mds_bal_split_rd: 25000, mds_bal_split_wr: 1, mds_bal_split_bits: 3, mds_bal_merge_size: 50, mds_bal_merge_rd: 1000, mds_bal_merge_wr: 1000, mds_bal_interval: 10, mds_bal_fragment_interval: 5, mds_bal_idle_threshold: 0, mds_bal_max: -1, mds_bal_max_until: -1, mds_bal_mode: 0, mds_bal_min_rebalance: 0.1, mds_bal_min_start: 0.2, mds_bal_need_min: 0.8, mds_bal_need_max: 1.2, mds_bal_midchunk: 0.3, mds_bal_minchunk: 0.001, mds_bal_target_removal_min: 5, mds_bal_target_removal_max: 10, mds_replay_interval: 1, mds_shutdown_check: 0, mds_thrash_exports: 0, mds_thrash_fragments: 0, mds_dump_cache_on_map: false, mds_dump_cache_after_rejoin: false, mds_verify_scatter: false, mds_debug_scatterstat: false, mds_debug_frag: false, mds_debug_auth_pins: false, mds_debug_subtrees: false, mds_kill_mdstable_at: 0, mds_kill_export_at: 0, mds_kill_import_at: 0, mds_kill_link_at: 0, mds_kill_rename_at: 0, mds_kill_openc_at: 0, mds_kill_journal_at: 0, mds_kill_journal_expire_at: 0, mds_kill_journal_replay_at: 0, mds_journal_format: 1, mds_kill_create_at: 0, mds_inject_traceless_reply_probability: 0, mds_wipe_sessions: false, mds_wipe_ino_prealloc: false, mds_skip_ino: 0, max_mds: 1,
Re: [ceph-users] mds0: Client failing to respond to cache pressure
Thank you John, All my server is ubuntu14.04 with 3.16 kernel. Not all of clients appear this problem, the cluster seems functioning well now. As you say,i will change the mds_cache_size to 50 from 10 to take a test, thanks again! 2015-07-10 17:00 GMT+08:00 John Spray john.sp...@redhat.com: This is usually caused by use of older kernel clients. I don't remember exactly what version it was fixed in, but iirc we've seen the problem with 3.14 and seen it go away with 3.18. If your system is otherwise functioning well, this is not a critical error -- it just means that the MDS might not be able to fully control its memory usage (i.e. it can exceed mds_cache_size). John On 10/07/2015 05:25, 谷枫 wrote: hi, I use CephFS in production environnement with 7osd,1mds,3mon now. So far so good,but i have a problem with it today. The ceph status report this: cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228 health HEALTH_WARN mds0: Client 34271 failing to respond to cache pressure mds0: Client 74175 failing to respond to cache pressure mds0: Client 74181 failing to respond to cache pressure mds0: Client 34247 failing to respond to cache pressure mds0: Client 64162 failing to respond to cache pressure mds0: Client 136744 failing to respond to cache pressure monmap e2: 3 mons at {node01= 10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0 http://10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0} election epoch 186, quorum 0,1,2 node01,node02,node03 mdsmap e46: 1/1/1 up {0=tree01=up:active} osdmap e717: 7 osds: 7 up, 7 in pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects 138 GB used, 1364 GB / 1502 GB avail 264 active+clean client io 1018 B/s rd, 1273 B/s wr, 0 op/s I add two osds with the version 0.94.2 and other old osds is 0.94.1 yesterday. So the question is does this matter? What's the warning mean ,and how can i solve this problem.Thanks! This is my cluster config message with mds: name: mds.tree01, debug_mds: 1\/5, debug_mds_balancer: 1\/5, debug_mds_locker: 1\/5, debug_mds_log: 1\/5, debug_mds_log_expire: 1\/5, debug_mds_migrator: 1\/5, admin_socket: \/var\/run\/ceph\/ceph-mds.tree01.asok, log_file: \/var\/log\/ceph\/ceph-mds.tree01.log, keyring: \/var\/lib\/ceph\/mds\/ceph-tree01\/keyring, mon_max_mdsmap_epochs: 500, mon_mds_force_trim_to: 0, mon_debug_dump_location: \/var\/log\/ceph\/ceph-mds.tree01.tdump, client_use_random_mds: false, mds_data: \/var\/lib\/ceph\/mds\/ceph-tree01, mds_max_file_size: 1099511627776, mds_cache_size: 10, mds_cache_mid: 0.7, mds_max_file_recover: 32, mds_mem_max: 1048576, mds_dir_max_commit_size: 10, mds_decay_halflife: 5, mds_beacon_interval: 4, mds_beacon_grace: 15, mds_enforce_unique_name: true, mds_blacklist_interval: 1440, mds_session_timeout: 120, mds_revoke_cap_timeout: 60, mds_recall_state_timeout: 60, mds_freeze_tree_timeout: 30, mds_session_autoclose: 600, mds_health_summarize_threshold: 10, mds_reconnect_timeout: 45, mds_tick_interval: 5, mds_dirstat_min_interval: 1, mds_scatter_nudge_interval: 5, mds_client_prealloc_inos: 1000, mds_early_reply: true, mds_default_dir_hash: 2, mds_log: true, mds_log_skip_corrupt_events: false, mds_log_max_events: -1, mds_log_events_per_segment: 1024, mds_log_segment_size: 0, mds_log_max_segments: 30, mds_log_max_expiring: 20, mds_bal_sample_interval: 3, mds_bal_replicate_threshold: 8000, mds_bal_unreplicate_threshold: 0, mds_bal_frag: false, mds_bal_split_size: 1, mds_bal_split_rd: 25000, mds_bal_split_wr: 1, mds_bal_split_bits: 3, mds_bal_merge_size: 50, mds_bal_merge_rd: 1000, mds_bal_merge_wr: 1000, mds_bal_interval: 10, mds_bal_fragment_interval: 5, mds_bal_idle_threshold: 0, mds_bal_max: -1, mds_bal_max_until: -1, mds_bal_mode: 0, mds_bal_min_rebalance: 0.1, mds_bal_min_start: 0.2, mds_bal_need_min: 0.8, mds_bal_need_max: 1.2, mds_bal_midchunk: 0.3, mds_bal_minchunk: 0.001, mds_bal_target_removal_min: 5, mds_bal_target_removal_max: 10, mds_replay_interval: 1, mds_shutdown_check: 0, mds_thrash_exports: 0, mds_thrash_fragments: 0, mds_dump_cache_on_map: false, mds_dump_cache_after_rejoin: false, mds_verify_scatter: false, mds_debug_scatterstat: false, mds_debug_frag: false, mds_debug_auth_pins: false, mds_debug_subtrees: false, mds_kill_mdstable_at: 0, mds_kill_export_at: 0,
[ceph-users] mds0: Client failing to respond to cache pressure
hi, I use CephFS in production environnement with 7osd,1mds,3mon now. So far so good,but i have a problem with it today. The ceph status report this: cluster ad3421a43-9fd4-4b7a-92ba-09asde3b1a228 health HEALTH_WARN mds0: Client 34271 failing to respond to cache pressure mds0: Client 74175 failing to respond to cache pressure mds0: Client 74181 failing to respond to cache pressure mds0: Client 34247 failing to respond to cache pressure mds0: Client 64162 failing to respond to cache pressure mds0: Client 136744 failing to respond to cache pressure monmap e2: 3 mons at {node01=10.3.1.2:6789/0,node02=10.3.1.3:6789/0,node03=10.3.1.4:6789/0} election epoch 186, quorum 0,1,2 node01,node02,node03 mdsmap e46: 1/1/1 up {0=tree01=up:active} osdmap e717: 7 osds: 7 up, 7 in pgmap v995836: 264 pgs, 3 pools, 51544 MB data, 118 kobjects 138 GB used, 1364 GB / 1502 GB avail 264 active+clean client io 1018 B/s rd, 1273 B/s wr, 0 op/s I add two osds with the version 0.94.2 and other old osds is 0.94.1 yesterday. So the question is does this matter? What's the warning mean ,and how can i solve this problem.Thanks! This is my cluster config message with mds: name: mds.tree01, debug_mds: 1\/5, debug_mds_balancer: 1\/5, debug_mds_locker: 1\/5, debug_mds_log: 1\/5, debug_mds_log_expire: 1\/5, debug_mds_migrator: 1\/5, admin_socket: \/var\/run\/ceph\/ceph-mds.tree01.asok, log_file: \/var\/log\/ceph\/ceph-mds.tree01.log, keyring: \/var\/lib\/ceph\/mds\/ceph-tree01\/keyring, mon_max_mdsmap_epochs: 500, mon_mds_force_trim_to: 0, mon_debug_dump_location: \/var\/log\/ceph\/ceph-mds.tree01.tdump, client_use_random_mds: false, mds_data: \/var\/lib\/ceph\/mds\/ceph-tree01, mds_max_file_size: 1099511627776, mds_cache_size: 10, mds_cache_mid: 0.7, mds_max_file_recover: 32, mds_mem_max: 1048576, mds_dir_max_commit_size: 10, mds_decay_halflife: 5, mds_beacon_interval: 4, mds_beacon_grace: 15, mds_enforce_unique_name: true, mds_blacklist_interval: 1440, mds_session_timeout: 120, mds_revoke_cap_timeout: 60, mds_recall_state_timeout: 60, mds_freeze_tree_timeout: 30, mds_session_autoclose: 600, mds_health_summarize_threshold: 10, mds_reconnect_timeout: 45, mds_tick_interval: 5, mds_dirstat_min_interval: 1, mds_scatter_nudge_interval: 5, mds_client_prealloc_inos: 1000, mds_early_reply: true, mds_default_dir_hash: 2, mds_log: true, mds_log_skip_corrupt_events: false, mds_log_max_events: -1, mds_log_events_per_segment: 1024, mds_log_segment_size: 0, mds_log_max_segments: 30, mds_log_max_expiring: 20, mds_bal_sample_interval: 3, mds_bal_replicate_threshold: 8000, mds_bal_unreplicate_threshold: 0, mds_bal_frag: false, mds_bal_split_size: 1, mds_bal_split_rd: 25000, mds_bal_split_wr: 1, mds_bal_split_bits: 3, mds_bal_merge_size: 50, mds_bal_merge_rd: 1000, mds_bal_merge_wr: 1000, mds_bal_interval: 10, mds_bal_fragment_interval: 5, mds_bal_idle_threshold: 0, mds_bal_max: -1, mds_bal_max_until: -1, mds_bal_mode: 0, mds_bal_min_rebalance: 0.1, mds_bal_min_start: 0.2, mds_bal_need_min: 0.8, mds_bal_need_max: 1.2, mds_bal_midchunk: 0.3, mds_bal_minchunk: 0.001, mds_bal_target_removal_min: 5, mds_bal_target_removal_max: 10, mds_replay_interval: 1, mds_shutdown_check: 0, mds_thrash_exports: 0, mds_thrash_fragments: 0, mds_dump_cache_on_map: false, mds_dump_cache_after_rejoin: false, mds_verify_scatter: false, mds_debug_scatterstat: false, mds_debug_frag: false, mds_debug_auth_pins: false, mds_debug_subtrees: false, mds_kill_mdstable_at: 0, mds_kill_export_at: 0, mds_kill_import_at: 0, mds_kill_link_at: 0, mds_kill_rename_at: 0, mds_kill_openc_at: 0, mds_kill_journal_at: 0, mds_kill_journal_expire_at: 0, mds_kill_journal_replay_at: 0, mds_journal_format: 1, mds_kill_create_at: 0, mds_inject_traceless_reply_probability: 0, mds_wipe_sessions: false, mds_wipe_ino_prealloc: false, mds_skip_ino: 0, max_mds: 1, mds_standby_for_name: , mds_standby_for_rank: -1, mds_standby_replay: false, mds_enable_op_tracker: true, mds_op_history_size: 20, mds_op_history_duration: 600, mds_op_complaint_time: 30, mds_op_log_threshold: 5, mds_snap_min_uid: 0, mds_snap_max_uid: 65536, mds_verify_backtrace: 1, mds_action_on_write_error: 1, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com