Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions
On 06/02/2015 07:08 PM, Nick Fisk wrote: Hi Kenneth, I suggested an idea which may help with this, it is being currently being developed . https://github.com/ceph/ceph/pull/4792 In short there is a high and low threshold with different flushing priorities. Hopefully this will help with bursty workloads. Thanks! Will this also increase the absolute flushing speed? Because I think the problem is more the absolute speed.. It is not my workload that is bursty, but the actual processing to the ceph cluster, because the cache flushes slower than new data entering. Now I see my cold storage disks aren't doing a lot of usage (see iostat usage other email), so is there a way to increase the flushing speed by tuning the cache agent for eg parallelism.. ? Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: 02 June 2015 17:54 To: ceph-users@lists.ceph.com Subject: [ceph-users] bursty IO, ceph cache pool can not follow evictions Hi, we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache layer upon an erasure coded pool. This was going on for some time, and didn't have real problems. Today we added 2 more streams, and very soon we saw some strange behaviour: - We are getting blocked requests on our cache pool osds - our cache pool is often near/ at max ratio - Our data streams have very bursty IO, (streaming a minute a few hunderds MB and then nothing) Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), though it seems like the cache pool can not evict objects in time, and get blocked until that is ok, each time again. If I rise the target_max_bytes limit, it starts streaming again until it is full again. cache parameters we have are these: ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024)) ceph osd pool set cache cache_target_dirty_ratio 0.4 ceph osd pool set cache cache_target_full_ratio 0.8 What can be the issue here ? I tried to find some information about the 'cache agent' , but can only find some old references.. Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: 03 June 2015 10:51 To: Nick Fisk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions On 06/02/2015 07:08 PM, Nick Fisk wrote: Hi Kenneth, I suggested an idea which may help with this, it is being currently being developed . https://github.com/ceph/ceph/pull/4792 In short there is a high and low threshold with different flushing priorities. Hopefully this will help with bursty workloads. Thanks! Will this also increase the absolute flushing speed? Because I think the problem is more the absolute speed.. It is not my workload that is bursty, but the actual processing to the ceph cluster, because the cache flushes slower than new data entering. Now I see my cold storage disks aren't doing a lot of usage (see iostat usage other email), so is there a way to increase the flushing speed by tuning the cache agent for eg parallelism.. ? To be honest with you, I'm not 100% sure. I see similar issues to you where performance seems really poor compared to the actual amount of disk activity. My best hunch is that it's probably to do with overwriting existing blocks that 1st have to be promoted to the cache tier before being overwritten. These will always cause a block on the operation as it can't continue until the object is in the cache tier. A few things I have also been toying with seem to help in my case, not sure if they will help in yours though. 1. Making the object size smaller. Ie, for RBD dropping from 4MB object sizes to 1MB decreases latency of promotion/demotion operations 2. Making the cache tier a lot bigger to reduce promotions/demotions 3. Using something like flashcache in front of the RBD to hide this promotion latency from the workload Nick Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: 02 June 2015 17:54 To: ceph-users@lists.ceph.com Subject: [ceph-users] bursty IO, ceph cache pool can not follow evictions Hi, we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache layer upon an erasure coded pool. This was going on for some time, and didn't have real problems. Today we added 2 more streams, and very soon we saw some strange behaviour: - We are getting blocked requests on our cache pool osds - our cache pool is often near/ at max ratio - Our data streams have very bursty IO, (streaming a minute a few hunderds MB and then nothing) Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), though it seems like the cache pool can not evict objects in time, and get blocked until that is ok, each time again. If I rise the target_max_bytes limit, it starts streaming again until it is full again. cache parameters we have are these: ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024)) ceph osd pool set cache cache_target_dirty_ratio 0.4 ceph osd pool set cache cache_target_full_ratio 0.8 What can be the issue here ? I tried to find some information about the 'cache agent' , but can only find some old references.. Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions
On 06/02/2015 07:21 PM, Paul Evans wrote: Kenneth, My guess is that you’re hitting the cache_target_full_ratio on an individual OSD, which is easy to do since most of us tend to think of the cache_target_full_ratio as an aggregate of the OSDs (which it is not according to Greg Farnum). This posting may shed more light on the issue, if it is indeed what you are bumping up against. https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg20207.html It looks like this indeed, then the question is why it is not flushing more? BTW: how are you determining that your OSDs are ‘not overloaded?’ Are you judging that by iostat utilization, or by capacity consumed? iostat is showing low utilisation on all disks; soem disks are doing 'nothing': Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdn 0.00 0.00 813.50 415.0016.9015.49 53.99 0.420.350.150.72 0.10 12.00 sdm 0.00 0.00 820.50 490.5013.0621.99 54.76 0.700.540.181.13 0.12 15.50 sdq 0.00 1.50 14.00 47.00 0.98 0.33 43.99 0.558.93 18.935.96 6.31 38.50 sdr 0.00 0.000.000.50 0.00 0.00 14.00 0.000.000.000.00 0.00 0.00 sdd 0.00 9.504.00 21.50 0.27 1.47 140.00 0.124.712.505.12 4.31 11.00 sda 0.00 8.502.50 14.50 0.26 0.71 116.91 0.084.414.004.48 4.71 8.00 sdh 0.00 6.002.00 15.00 0.25 1.10 162.59 0.073.827.503.33 3.53 6.00 sdf 0.0017.503.00 25.00 0.32 1.01 97.48 0.238.215.008.60 8.21 23.00 sdi 0.0011.001.00 31.50 0.07 2.23 144.60 0.144.460.004.60 3.85 12.50 sdo 0.00 0.000.001.00 0.00 0.00 8.00 0.000.000.000.00 0.00 0.00 sdk 0.00 0.00 22.500.00 1.58 0.00 143.82 0.135.785.780.00 4.00 9.00 sdg 0.00 2.500.00 30.00 0.00 3.35 228.52 0.144.500.004.50 1.33 4.00 sdc 0.0012.501.50 23.50 0.01 1.36 111.68 0.176.800.007.23 6.20 15.50 sdj 0.0018.50 27.50 30.50 2.28 1.65 138.82 0.437.337.826.89 5.86 34.00 sde 0.00 4.000.50 15.00 0.04 0.10 18.10 0.074.84 10.004.67 2.58 4.00 sdl 0.0023.006.00 33.00 0.58 1.31 99.22 0.287.05 17.505.15 6.79 26.50 sdb 0.00 5.003.009.00 0.12 0.47 100.29 0.054.581.675.56 3.75 4.50 In my opinion there should be enough resources to do flushing, and therefore not getting a full cache.. -- *Paul * * * * * On Jun 2, 2015, at 9:53 AM, Kenneth Waegeman kenneth.waege...@ugent.be mailto:kenneth.waege...@ugent.be wrote: Hi, we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache layer upon an erasure coded pool. This was going on for some time, and didn't have real problems. Today we added 2 more streams, and very soon we saw some strange behaviour: - We are getting blocked requests on our cache pool osds - our cache pool is often near/ at max ratio - Our data streams have very bursty IO, (streaming a minute a few hunderds MB and then nothing) Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), though it seems like the cache pool can not evict objects in time, and get blocked until that is ok, each time again. If I rise the target_max_bytes limit, it starts streaming again until it is full again. cache parameters we have are these: ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024)) ceph osd pool set cache cache_target_dirty_ratio 0.4 ceph osd pool set cache cache_target_full_ratio 0.8 What can be the issue here ? I tried to find some information about the 'cache agent' , but can only find some old references.. Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] bursty IO, ceph cache pool can not follow evictions
Hi, we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache layer upon an erasure coded pool. This was going on for some time, and didn't have real problems. Today we added 2 more streams, and very soon we saw some strange behaviour: - We are getting blocked requests on our cache pool osds - our cache pool is often near/ at max ratio - Our data streams have very bursty IO, (streaming a minute a few hunderds MB and then nothing) Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), though it seems like the cache pool can not evict objects in time, and get blocked until that is ok, each time again. If I rise the target_max_bytes limit, it starts streaming again until it is full again. cache parameters we have are these: ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024)) ceph osd pool set cache cache_target_dirty_ratio 0.4 ceph osd pool set cache cache_target_full_ratio 0.8 What can be the issue here ? I tried to find some information about the 'cache agent' , but can only find some old references.. Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions
Hi Kenneth, I suggested an idea which may help with this, it is being currently being developed . https://github.com/ceph/ceph/pull/4792 In short there is a high and low threshold with different flushing priorities. Hopefully this will help with bursty workloads. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: 02 June 2015 17:54 To: ceph-users@lists.ceph.com Subject: [ceph-users] bursty IO, ceph cache pool can not follow evictions Hi, we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache layer upon an erasure coded pool. This was going on for some time, and didn't have real problems. Today we added 2 more streams, and very soon we saw some strange behaviour: - We are getting blocked requests on our cache pool osds - our cache pool is often near/ at max ratio - Our data streams have very bursty IO, (streaming a minute a few hunderds MB and then nothing) Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), though it seems like the cache pool can not evict objects in time, and get blocked until that is ok, each time again. If I rise the target_max_bytes limit, it starts streaming again until it is full again. cache parameters we have are these: ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024)) ceph osd pool set cache cache_target_dirty_ratio 0.4 ceph osd pool set cache cache_target_full_ratio 0.8 What can be the issue here ? I tried to find some information about the 'cache agent' , but can only find some old references.. Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bursty IO, ceph cache pool can not follow evictions
Kenneth, My guess is that you’re hitting the cache_target_full_ratio on an individual OSD, which is easy to do since most of us tend to think of the cache_target_full_ratio as an aggregate of the OSDs (which it is not according to Greg Farnum). This posting may shed more light on the issue, if it is indeed what you are bumping up against. https://www.mail-archive.com/ceph-users%40lists.ceph.com/msg20207.html BTW: how are you determining that your OSDs are ‘not overloaded?’ Are you judging that by iostat utilization, or by capacity consumed? -- Paul On Jun 2, 2015, at 9:53 AM, Kenneth Waegeman kenneth.waege...@ugent.bemailto:kenneth.waege...@ugent.be wrote: Hi, we were rsync-streaming with 4 cephfs client to a ceph cluster with a cache layer upon an erasure coded pool. This was going on for some time, and didn't have real problems. Today we added 2 more streams, and very soon we saw some strange behaviour: - We are getting blocked requests on our cache pool osds - our cache pool is often near/ at max ratio - Our data streams have very bursty IO, (streaming a minute a few hunderds MB and then nothing) Our OSDs are not overloaded (nor the ECs nor cache, checked with iostat), though it seems like the cache pool can not evict objects in time, and get blocked until that is ok, each time again. If I rise the target_max_bytes limit, it starts streaming again until it is full again. cache parameters we have are these: ceph osd pool set cache hit_set_type bloom ceph osd pool set cache hit_set_count 1 ceph osd pool set cache hit_set_period 3600 ceph osd pool set cache target_max_bytes $((14*75*1024*1024*1024)) ceph osd pool set cache cache_target_dirty_ratio 0.4 ceph osd pool set cache cache_target_full_ratio 0.8 What can be the issue here ? I tried to find some information about the 'cache agent' , but can only find some old references.. Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com