Well, you never know ! It depends upon lot of factors starting from your workload/different kernel params/RAID controller etc. etc. I have shared my observation in my environment with 4K pseudo random fio_rbd workload. True random, should not kick off read_ahead though. OP_QUEUE optimization is bringing more parallelism in the filestore read , so, more read going to disk in parallel may have exposed this. Anyways, I am in process of analyzing why default read_ahead is causing problem for me, will update if I find any..
Thanks & Regards Somnath -----Original Message----- From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] Sent: Wednesday, September 24, 2014 10:00 PM To: Somnath Roy; Haomai Wang Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org Subject: RE: Impact of page cache on OSD read performance for SSD Have you ever seen large readahead_kb would hear random performance? We usually set it to very large (2M) , the random read performance keep steady, even in all SSD setup. Maybe with your optimization code for OP_QUEUE, the things may different? -----Original Message----- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Thursday, September 25, 2014 11:15 AM To: Haomai Wang Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org Subject: RE: Impact of page cache on OSD read performance for SSD It will be definitely hampered. There will not be a single solution fits all. These parameters needs to be tuned based on the workload. Thanks & Regards Somnath -----Original Message----- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, September 24, 2014 7:56 PM To: Somnath Roy Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org Subject: Re: Impact of page cache on OSD read performance for SSD On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy <somnath....@sandisk.com> wrote: > Hi, > After going through the blktrace, I think I have figured out what is > going on there. I think kernel read_ahead is causing the extra reads > in case of buffered read. If I set read_ahead = 0 , the performance I > am getting similar (or more when cache hit actually happens) to > direct_io :-) Hmm, BTW if set read_ahead=0, what about seq read performance compared to before? > IMHO, if any user doesn't want these nasty kernel effects and be sure of the > random work pattern, we should provide a configurable direct_io read option > (Need to quantify direct_io write also) as Sage suggested. > > Thanks & Regards > Somnath > > > -----Original Message----- > From: Haomai Wang [mailto:haomaiw...@gmail.com] > Sent: Wednesday, September 24, 2014 9:06 AM > To: Sage Weil > Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org > Subject: Re: Impact of page cache on OSD read performance for SSD > > On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil <sw...@redhat.com> wrote: >> On Wed, 24 Sep 2014, Haomai Wang wrote: >>> I agree with that direct read will help for disk read. But if read >>> data is hot and small enough to fit in memory, page cache is a good >>> place to hold data cache. If discard page cache, we need to >>> implement a cache to provide with effective lookup impl. >> >> This is true for some workloads, but not necessarily true for all. >> Many clients (notably RBD) will be caching at the client side (in >> VM's fs, and possibly in librbd itself) such that caching at the OSD >> is largely wasted effort. For RGW the often is likely true, unless >> there is a varnish cache or something in front. > > Still now, I don't think librbd cache can meet all the cache demand for rbd > usage. Even though we have a effective librbd cache impl, we still need a > buffer cache in ObjectStore level just like what database did. Client cache > and host cache are both needed. > >> >> We should probably have a direct_io config option for filestore. But >> even better would be some hint from the client about whether it is >> caching or not so that FileStore could conditionally cache... > > Yes, I remember we already did some early works like it. > >> >> sage >> >> > >>> BTW, whether to use direct io we can refer to MySQL Innodb engine >>> with direct io and PostgreSQL with page cache. >>> >>> On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy <somnath....@sandisk.com> >>> wrote: >>> > Haomai, >>> > I am considering only about random reads and the changes I made only >>> > affecting reads. For write, I have not measured yet. But, yes, page cache >>> > may be helpful for write coalescing. Still need to evaluate how it is >>> > behaving comparing direct_io on SSD though. I think Ceph code path will >>> > be much shorter if we use direct_io in the write path where it is >>> > actually executing the transactions. Probably, the sync thread and all >>> > will not be needed. >>> > >>> > I am trying to analyze where is the extra reads coming from in case of >>> > buffered io by using blktrace etc. This should give us a clear >>> > understanding what exactly is going on there and it may turn out that >>> > tuning kernel parameters only we can achieve similar performance as >>> > direct_io. >>> > >>> > Thanks & Regards >>> > Somnath >>> > >>> > -----Original Message----- >>> > From: Haomai Wang [mailto:haomaiw...@gmail.com] >>> > Sent: Tuesday, September 23, 2014 7:07 PM >>> > To: Sage Weil >>> > Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org >>> > Subject: Re: Impact of page cache on OSD read performance for SSD >>> > >>> > Good point, but do you have considered that the impaction for write ops? >>> > And if skip page cache, FileStore is responsible for data cache? >>> > >>> > On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <sw...@redhat.com> wrote: >>> >> On Tue, 23 Sep 2014, Somnath Roy wrote: >>> >>> Milosz, >>> >>> Thanks for the response. I will see if I can get any information out of >>> >>> perf. >>> >>> >>> >>> Here is my OS information. >>> >>> >>> >>> root@emsclient:~# lsb_release -a No LSB modules are available. >>> >>> Distributor ID: Ubuntu >>> >>> Description: Ubuntu 13.10 >>> >>> Release: 13.10 >>> >>> Codename: saucy >>> >>> root@emsclient:~# uname -a >>> >>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 >>> >>> 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux >>> >>> >>> >>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I >>> >>> was able to get almost *2X* performance improvement with direct_io. >>> >>> It's not only page cache (memory) lookup, in case of buffered_io the >>> >>> following could be problem. >>> >>> >>> >>> 1. Double copy (disk -> file buffer cache, file buffer cache -> >>> >>> user >>> >>> buffer) >>> >>> >>> >>> 2. As the iostat output shows, it is not reading 4K only, it is >>> >>> reading more data from disk as required and in the end it will >>> >>> be wasted in case of random workload.. >>> >> >>> >> It might be worth using blktrace to see what the IOs it is issueing are. >>> >> Which ones are > 4K and what they point to... >>> >> >>> >> sage >>> >> >>> >> >>> >>> >>> >>> Thanks & Regards >>> >>> Somnath >>> >>> >>> >>> -----Original Message----- >>> >>> From: Milosz Tanski [mailto:mil...@adfin.com] >>> >>> Sent: Tuesday, September 23, 2014 12:09 PM >>> >>> To: Somnath Roy >>> >>> Cc: ceph-devel@vger.kernel.org >>> >>> Subject: Re: Impact of page cache on OSD read performance for >>> >>> SSD >>> >>> >>> >>> Somnath, >>> >>> >>> >>> I wonder if there's a bottleneck or a point of contention for the >>> >>> kernel. For a entirely uncached workload I expect the page cache lookup >>> >>> to cause a slow down (since the lookup should be wasted). What I >>> >>> wouldn't expect is a 45% performance drop. Memory speed should be one >>> >>> magnitude faster then a modern SATA SSD drive (so it should be more >>> >>> negligible overhead). >>> >>> >>> >>> Is there anyway you could perform the same test but monitor what's >>> >>> going on with the OSD process using the perf tool? Whatever is the >>> >>> default cpu time spent hardware counter is fine. Make sure you have the >>> >>> kernel debug info package installed so can get symbol information for >>> >>> kernel and module calls. With any luck the diff in perf output in two >>> >>> runs will show us the culprit. >>> >>> >>> >>> Also, can you tell us what OS/kernel version you're using on the OSD >>> >>> machines? >>> >>> >>> >>> - Milosz >>> >>> >>> >>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <somnath....@sandisk.com> >>> >>> wrote: >>> >>> > Hi Sage, >>> >>> > I have created the following setup in order to examine how a single >>> >>> > OSD is behaving if say ~80-90% of ios hitting the SSDs. >>> >>> > >>> >>> > My test includes the following steps. >>> >>> > >>> >>> > 1. Created a single OSD cluster. >>> >>> > 2. Created two rbd images (110GB each) on 2 different pools. >>> >>> > 3. Populated entire image, so my working set is ~210GB. My >>> >>> > system memory is ~16GB. >>> >>> > 4. Dumped page cache before every run. >>> >>> > 5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two >>> >>> > images. >>> >>> > >>> >>> > Here is my disk iops/bandwidth.. >>> >>> > >>> >>> > root@emsclient:~/fio_test# fio rad_resd_disk.job >>> >>> > random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, >>> >>> > ioengine=libaio, iodepth=64 >>> >>> > 2.0.8 >>> >>> > Starting 1 process >>> >>> > Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 >>> >>> > iops] [eta 00m:00s] >>> >>> > random-reads: (groupid=0, jobs=1): err= 0: pid=1431 >>> >>> > read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= >>> >>> > 60002msec >>> >>> > >>> >>> > My fio_rbd config.. >>> >>> > >>> >>> > [global] >>> >>> > ioengine=rbd >>> >>> > clientname=admin >>> >>> > pool=rbd1 >>> >>> > rbdname=ceph_regression_test1 >>> >>> > invalidate=0 # mandatory >>> >>> > rw=randread >>> >>> > bs=4k >>> >>> > direct=1 >>> >>> > time_based >>> >>> > runtime=2m >>> >>> > size=109G >>> >>> > numjobs=8 >>> >>> > [rbd_iodepth32] >>> >>> > iodepth=32 >>> >>> > >>> >>> > Now, I have run Giant Ceph on top of that.. >>> >>> > >>> >>> > 1. OSD config with 25 shards/1 thread per shard : >>> >>> > ------------------------------------------------------- >>> >>> > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle >>> >>> > 22.04 0.00 16.46 45.86 0.00 15.64 >>> >>> > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> > sda 0.00 9.00 0.00 6.00 0.00 92.00 >>> >>> > 30.67 0.01 1.33 0.00 1.33 1.33 0.80 >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 >>> >>> > 10.11 102.71 2.92 2.92 0.00 0.03 100.00 >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > >>> >>> > >>> >>> > ceph -s: >>> >>> > ---------- >>> >>> > root@emsclient:~# ceph -s >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 >>> >>> > health HEALTH_OK >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, >>> >>> > quorum 0 a >>> >>> > osdmap e498: 1 osds: 1 up, 1 in >>> >>> > pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>> >>> > 366 GB used, 1122 GB / 1489 GB avail >>> >>> > 832 active+clean >>> >>> > client io 75215 kB/s rd, 18803 op/s >>> >>> > >>> >>> > cpu util: >>> >>> > ---------- >>> >>> > Gradually decreases from ~21 core (serving from cache) to ~10 core >>> >>> > (while serving from disks). >>> >>> > >>> >>> > My Analysis: >>> >>> > ----------------- >>> >>> > In this case "All is Well" till ios are served from cache >>> >>> > (XFS is smart enough to cache some data ) . Once started hitting >>> >>> > disks and throughput is decreasing. As you can see, disk is giving >>> >>> > ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in >>> >>> > case of buffered io seems to be very expensive. Half of the iops >>> >>> > are waste. Also, looking at the bandwidth, it is obvious, not >>> >>> > everything is 4K read, May be kernel read_ahead is kicking (?). >>> >>> > >>> >>> > >>> >>> > Now, I thought of making ceph disk read as direct_io and do the same >>> >>> > experiment. I have changed the FileStore::read to do the direct_io >>> >>> > only. Rest kept as is. Here is the result with that. >>> >>> > >>> >>> > >>> >>> > Iostat: >>> >>> > ------- >>> >>> > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle >>> >>> > 24.77 0.00 19.52 21.36 0.00 34.36 >>> >>> > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 >>> >>> > 8.00 12.73 0.50 0.50 0.00 0.04 100.80 >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > >>> >>> > ceph -s: >>> >>> > -------- >>> >>> > root@emsclient:~/fio_test# ceph -s >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 >>> >>> > health HEALTH_OK >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, >>> >>> > quorum 0 a >>> >>> > osdmap e522: 1 osds: 1 up, 1 in >>> >>> > pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>> >>> > 366 GB used, 1122 GB / 1489 GB avail >>> >>> > 832 active+clean >>> >>> > client io 100 MB/s rd, 25618 op/s >>> >>> > >>> >>> > cpu util: >>> >>> > -------- >>> >>> > ~14 core while serving from disks. >>> >>> > >>> >>> > My Analysis: >>> >>> > --------------- >>> >>> > No surprises here. Whatever is disk throughput ceph throughput is >>> >>> > almost matching. >>> >>> > >>> >>> > >>> >>> > Let's tweak the shard/thread settings and see the impact. >>> >>> > >>> >>> > >>> >>> > 2. OSD config with 36 shards and 1 thread/shard: >>> >>> > ----------------------------------------------------------- >>> >>> > >>> >>> > Buffered read: >>> >>> > ------------------ >>> >>> > No change, output is very similar to 25 shards. >>> >>> > >>> >>> > >>> >>> > direct_io read: >>> >>> > ------------------ >>> >>> > Iostat: >>> >>> > ---------- >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle >>> >>> > 33.33 0.00 28.22 23.11 0.00 15.34 >>> >>> > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> > sda 0.00 0.00 0.00 2.00 0.00 12.00 >>> >>> > 12.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 >>> >>> > 8.00 18.06 0.56 0.56 0.00 0.03 100.40 >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > >>> >>> > ceph -s: >>> >>> > -------------- >>> >>> > root@emsclient:~/fio_test# ceph -s >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 >>> >>> > health HEALTH_OK >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, >>> >>> > quorum 0 a >>> >>> > osdmap e525: 1 osds: 1 up, 1 in >>> >>> > pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>> >>> > 366 GB used, 1122 GB / 1489 GB avail >>> >>> > 832 active+clean >>> >>> > client io 127 MB/s rd, 32763 op/s >>> >>> > >>> >>> > cpu util: >>> >>> > -------------- >>> >>> > ~19 core while serving from disks. >>> >>> > >>> >>> > Analysis: >>> >>> > ------------------ >>> >>> > It is scaling with increased number of shards/threads. The >>> >>> > parallelism also increased significantly. >>> >>> > >>> >>> > >>> >>> > 3. OSD config with 48 shards and 1 thread/shard: >>> >>> > ---------------------------------------------------------- >>> >>> > Buffered read: >>> >>> > ------------------- >>> >>> > No change, output is very similar to 25 shards. >>> >>> > >>> >>> > >>> >>> > direct_io read: >>> >>> > ----------------- >>> >>> > Iostat: >>> >>> > -------- >>> >>> > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle >>> >>> > 37.50 0.00 33.72 20.03 0.00 8.75 >>> >>> > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 >>> >>> > 8.00 22.25 0.62 0.62 0.00 0.03 100.40 >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > >>> >>> > ceph -s: >>> >>> > -------------- >>> >>> > root@emsclient:~/fio_test# ceph -s >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 >>> >>> > health HEALTH_OK >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, >>> >>> > quorum 0 a >>> >>> > osdmap e534: 1 osds: 1 up, 1 in >>> >>> > pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>> >>> > 366 GB used, 1122 GB / 1489 GB avail >>> >>> > 832 active+clean >>> >>> > client io 138 MB/s rd, 35582 op/s >>> >>> > >>> >>> > cpu util: >>> >>> > ---------------- >>> >>> > ~22.5 core while serving from disks. >>> >>> > >>> >>> > Analysis: >>> >>> > -------------------- >>> >>> > It is scaling with increased number of shards/threads. The >>> >>> > parallelism also increased significantly. >>> >>> > >>> >>> > >>> >>> > >>> >>> > 4. OSD config with 64 shards and 1 thread/shard: >>> >>> > --------------------------------------------------------- >>> >>> > Buffered read: >>> >>> > ------------------ >>> >>> > No change, output is very similar to 25 shards. >>> >>> > >>> >>> > >>> >>> > direct_io read: >>> >>> > ------------------- >>> >>> > Iostat: >>> >>> > --------- >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle >>> >>> > 40.18 0.00 34.84 19.81 0.00 5.18 >>> >>> > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 >>> >>> > 8.00 35.58 0.90 0.90 0.00 0.03 100.40 >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >>> >>> > >>> >>> > ceph -s: >>> >>> > --------------- >>> >>> > root@emsclient:~/fio_test# ceph -s >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 >>> >>> > health HEALTH_OK >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, >>> >>> > quorum 0 a >>> >>> > osdmap e537: 1 osds: 1 up, 1 in >>> >>> > pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects >>> >>> > 366 GB used, 1122 GB / 1489 GB avail >>> >>> > 832 active+clean >>> >>> > client io 153 MB/s rd, 39172 op/s >>> >>> > >>> >>> > cpu util: >>> >>> > ---------------- >>> >>> > ~24.5 core while serving from disks. ~3% cpu left. >>> >>> > >>> >>> > Analysis: >>> >>> > ------------------ >>> >>> > It is scaling with increased number of shards/threads. The >>> >>> > parallelism also increased significantly. It is disk bound now. >>> >>> > >>> >>> > >>> >>> > Summary: >>> >>> > >>> >>> > So, it seems buffered IO has significant impact on performance in >>> >>> > case backend is SSD. >>> >>> > My question is, if the workload is very random and storage(SSD) is >>> >>> > very huge compare to system memory, shouldn't we always go for >>> >>> > direct_io instead of buffered io from Ceph ? >>> >>> > >>> >>> > Please share your thoughts/suggestion on this. >>> >>> > >>> >>> > Thanks & Regards >>> >>> > Somnath >>> >>> > >>> >>> > ________________________________ >>> >>> > >>> >>> > PLEASE NOTE: The information contained in this electronic mail >>> >>> > message is intended only for the use of the designated recipient(s) >>> >>> > named above. If the reader of this message is not the intended >>> >>> > recipient, you are hereby notified that you have received this >>> >>> > message in error and that any review, dissemination, distribution, or >>> >>> > copying of this message is strictly prohibited. If you have received >>> >>> > this communication in error, please notify the sender by telephone or >>> >>> > e-mail (as shown above) immediately and destroy any and all copies of >>> >>> > this message in your possession (whether hard copies or >>> >>> > electronically stored copies). >>> >>> > >>> >>> > -- >>> >>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>> >>> > in the body of a message to majord...@vger.kernel.org More >>> >>> > majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Milosz Tanski >>> >>> CTO >>> >>> 16 East 34th Street, 15th floor >>> >>> New York, NY 10016 >>> >>> >>> >>> p: 646-253-9055 >>> >>> e: mil...@adfin.com >>> >>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? >>> >>> ?w??? ???j:+v???w???????? ????zZ+???????j"????i >>> >> -- >>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>> >> in the body of a message to majord...@vger.kernel.org More >>> >> majordomo info at http://vger.kernel.org/majordomo-info.html >>> > >>> > >>> > >>> > -- >>> > Best Regards, >>> > >>> > Wheat >>> >>> >>> >>> -- >>> Best Regards, >>> >>> Wheat >>> >>> > > > > -- > Best Regards, > > Wheat -- Best Regards, Wheat 칻 & ~ & +- ݶ w ˛ m ^ b ^n r z h & G h ( 階 ݢj" m z ޖ f h ~ m N�����r��y����b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�m��������zZ+�����ݢj"��!�i