On Thu, 25 Sep 2014, Somnath Roy wrote: > It will be definitely hampered. > There will not be a single solution fits all. These parameters needs to be > tuned based on the workload.
Can you do a test to see if fadvise with FADV_RANDOM is sufficient to prevent the readahead behavior? If so, we can potentially accomplish this with proper IO hinting from the clients. sage > > Thanks & Regards > Somnath > > -----Original Message----- > From: Haomai Wang [mailto:haomaiw...@gmail.com] > Sent: Wednesday, September 24, 2014 7:56 PM > To: Somnath Roy > Cc: Sage Weil; Milosz Tanski; ceph-devel@vger.kernel.org > Subject: Re: Impact of page cache on OSD read performance for SSD > > On Thu, Sep 25, 2014 at 7:49 AM, Somnath Roy <somnath....@sandisk.com> wrote: > > Hi, > > After going through the blktrace, I think I have figured out what is > > going on there. I think kernel read_ahead is causing the extra reads > > in case of buffered read. If I set read_ahead = 0 , the performance I > > am getting similar (or more when cache hit actually happens) to > > direct_io :-) > > Hmm, BTW if set read_ahead=0, what about seq read performance compared to > before? > > > IMHO, if any user doesn't want these nasty kernel effects and be sure of > > the random work pattern, we should provide a configurable direct_io read > > option (Need to quantify direct_io write also) as Sage suggested. > > > > Thanks & Regards > > Somnath > > > > > > -----Original Message----- > > From: Haomai Wang [mailto:haomaiw...@gmail.com] > > Sent: Wednesday, September 24, 2014 9:06 AM > > To: Sage Weil > > Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org > > Subject: Re: Impact of page cache on OSD read performance for SSD > > > > On Wed, Sep 24, 2014 at 8:38 PM, Sage Weil <sw...@redhat.com> wrote: > >> On Wed, 24 Sep 2014, Haomai Wang wrote: > >>> I agree with that direct read will help for disk read. But if read > >>> data is hot and small enough to fit in memory, page cache is a good > >>> place to hold data cache. If discard page cache, we need to > >>> implement a cache to provide with effective lookup impl. > >> > >> This is true for some workloads, but not necessarily true for all. > >> Many clients (notably RBD) will be caching at the client side (in > >> VM's fs, and possibly in librbd itself) such that caching at the OSD > >> is largely wasted effort. For RGW the often is likely true, unless > >> there is a varnish cache or something in front. > > > > Still now, I don't think librbd cache can meet all the cache demand for rbd > > usage. Even though we have a effective librbd cache impl, we still need a > > buffer cache in ObjectStore level just like what database did. Client cache > > and host cache are both needed. > > > >> > >> We should probably have a direct_io config option for filestore. But > >> even better would be some hint from the client about whether it is > >> caching or not so that FileStore could conditionally cache... > > > > Yes, I remember we already did some early works like it. > > > >> > >> sage > >> > >> > > >>> BTW, whether to use direct io we can refer to MySQL Innodb engine > >>> with direct io and PostgreSQL with page cache. > >>> > >>> On Wed, Sep 24, 2014 at 10:29 AM, Somnath Roy <somnath....@sandisk.com> > >>> wrote: > >>> > Haomai, > >>> > I am considering only about random reads and the changes I made only > >>> > affecting reads. For write, I have not measured yet. But, yes, page > >>> > cache may be helpful for write coalescing. Still need to evaluate how > >>> > it is behaving comparing direct_io on SSD though. I think Ceph code > >>> > path will be much shorter if we use direct_io in the write path where > >>> > it is actually executing the transactions. Probably, the sync thread > >>> > and all will not be needed. > >>> > > >>> > I am trying to analyze where is the extra reads coming from in case of > >>> > buffered io by using blktrace etc. This should give us a clear > >>> > understanding what exactly is going on there and it may turn out that > >>> > tuning kernel parameters only we can achieve similar performance as > >>> > direct_io. > >>> > > >>> > Thanks & Regards > >>> > Somnath > >>> > > >>> > -----Original Message----- > >>> > From: Haomai Wang [mailto:haomaiw...@gmail.com] > >>> > Sent: Tuesday, September 23, 2014 7:07 PM > >>> > To: Sage Weil > >>> > Cc: Somnath Roy; Milosz Tanski; ceph-devel@vger.kernel.org > >>> > Subject: Re: Impact of page cache on OSD read performance for SSD > >>> > > >>> > Good point, but do you have considered that the impaction for write > >>> > ops? And if skip page cache, FileStore is responsible for data cache? > >>> > > >>> > On Wed, Sep 24, 2014 at 3:29 AM, Sage Weil <sw...@redhat.com> wrote: > >>> >> On Tue, 23 Sep 2014, Somnath Roy wrote: > >>> >>> Milosz, > >>> >>> Thanks for the response. I will see if I can get any information out > >>> >>> of perf. > >>> >>> > >>> >>> Here is my OS information. > >>> >>> > >>> >>> root@emsclient:~# lsb_release -a No LSB modules are available. > >>> >>> Distributor ID: Ubuntu > >>> >>> Description: Ubuntu 13.10 > >>> >>> Release: 13.10 > >>> >>> Codename: saucy > >>> >>> root@emsclient:~# uname -a > >>> >>> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 > >>> >>> 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > >>> >>> > >>> >>> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter > >>> >>> I was able to get almost *2X* performance improvement with direct_io. > >>> >>> It's not only page cache (memory) lookup, in case of buffered_io the > >>> >>> following could be problem. > >>> >>> > >>> >>> 1. Double copy (disk -> file buffer cache, file buffer cache -> > >>> >>> user > >>> >>> buffer) > >>> >>> > >>> >>> 2. As the iostat output shows, it is not reading 4K only, it is > >>> >>> reading more data from disk as required and in the end it will > >>> >>> be wasted in case of random workload.. > >>> >> > >>> >> It might be worth using blktrace to see what the IOs it is issueing > >>> >> are. > >>> >> Which ones are > 4K and what they point to... > >>> >> > >>> >> sage > >>> >> > >>> >> > >>> >>> > >>> >>> Thanks & Regards > >>> >>> Somnath > >>> >>> > >>> >>> -----Original Message----- > >>> >>> From: Milosz Tanski [mailto:mil...@adfin.com] > >>> >>> Sent: Tuesday, September 23, 2014 12:09 PM > >>> >>> To: Somnath Roy > >>> >>> Cc: ceph-devel@vger.kernel.org > >>> >>> Subject: Re: Impact of page cache on OSD read performance for > >>> >>> SSD > >>> >>> > >>> >>> Somnath, > >>> >>> > >>> >>> I wonder if there's a bottleneck or a point of contention for the > >>> >>> kernel. For a entirely uncached workload I expect the page cache > >>> >>> lookup to cause a slow down (since the lookup should be wasted). What > >>> >>> I wouldn't expect is a 45% performance drop. Memory speed should be > >>> >>> one magnitude faster then a modern SATA SSD drive (so it should be > >>> >>> more negligible overhead). > >>> >>> > >>> >>> Is there anyway you could perform the same test but monitor what's > >>> >>> going on with the OSD process using the perf tool? Whatever is the > >>> >>> default cpu time spent hardware counter is fine. Make sure you have > >>> >>> the kernel debug info package installed so can get symbol information > >>> >>> for kernel and module calls. With any luck the diff in perf output in > >>> >>> two runs will show us the culprit. > >>> >>> > >>> >>> Also, can you tell us what OS/kernel version you're using on the OSD > >>> >>> machines? > >>> >>> > >>> >>> - Milosz > >>> >>> > >>> >>> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy > >>> >>> <somnath....@sandisk.com> wrote: > >>> >>> > Hi Sage, > >>> >>> > I have created the following setup in order to examine how a single > >>> >>> > OSD is behaving if say ~80-90% of ios hitting the SSDs. > >>> >>> > > >>> >>> > My test includes the following steps. > >>> >>> > > >>> >>> > 1. Created a single OSD cluster. > >>> >>> > 2. Created two rbd images (110GB each) on 2 different pools. > >>> >>> > 3. Populated entire image, so my working set is ~210GB. My > >>> >>> > system memory is ~16GB. > >>> >>> > 4. Dumped page cache before every run. > >>> >>> > 5. Ran fio_rbd (QD 32, 8 instances) in parallel on these > >>> >>> > two images. > >>> >>> > > >>> >>> > Here is my disk iops/bandwidth.. > >>> >>> > > >>> >>> > root@emsclient:~/fio_test# fio rad_resd_disk.job > >>> >>> > random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, > >>> >>> > ioengine=libaio, iodepth=64 > >>> >>> > 2.0.8 > >>> >>> > Starting 1 process > >>> >>> > Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0 > >>> >>> > iops] [eta 00m:00s] > >>> >>> > random-reads: (groupid=0, jobs=1): err= 0: pid=1431 > >>> >>> > read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= > >>> >>> > 60002msec > >>> >>> > > >>> >>> > My fio_rbd config.. > >>> >>> > > >>> >>> > [global] > >>> >>> > ioengine=rbd > >>> >>> > clientname=admin > >>> >>> > pool=rbd1 > >>> >>> > rbdname=ceph_regression_test1 > >>> >>> > invalidate=0 # mandatory > >>> >>> > rw=randread > >>> >>> > bs=4k > >>> >>> > direct=1 > >>> >>> > time_based > >>> >>> > runtime=2m > >>> >>> > size=109G > >>> >>> > numjobs=8 > >>> >>> > [rbd_iodepth32] > >>> >>> > iodepth=32 > >>> >>> > > >>> >>> > Now, I have run Giant Ceph on top of that.. > >>> >>> > > >>> >>> > 1. OSD config with 25 shards/1 thread per shard : > >>> >>> > ------------------------------------------------------- > >>> >>> > > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle > >>> >>> > 22.04 0.00 16.46 45.86 0.00 15.64 > >>> >>> > > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util > >>> >>> > sda 0.00 9.00 0.00 6.00 0.00 92.00 > >>> >>> > 30.67 0.01 1.33 0.00 1.33 1.33 0.80 > >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdh 181.00 0.00 34961.00 0.00 176740.00 0.00 > >>> >>> > 10.11 102.71 2.92 2.92 0.00 0.03 100.00 > >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > > >>> >>> > > >>> >>> > ceph -s: > >>> >>> > ---------- > >>> >>> > root@emsclient:~# ceph -s > >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 > >>> >>> > health HEALTH_OK > >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch > >>> >>> > 1, quorum 0 a > >>> >>> > osdmap e498: 1 osds: 1 up, 1 in > >>> >>> > pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects > >>> >>> > 366 GB used, 1122 GB / 1489 GB avail > >>> >>> > 832 active+clean > >>> >>> > client io 75215 kB/s rd, 18803 op/s > >>> >>> > > >>> >>> > cpu util: > >>> >>> > ---------- > >>> >>> > Gradually decreases from ~21 core (serving from cache) to ~10 core > >>> >>> > (while serving from disks). > >>> >>> > > >>> >>> > My Analysis: > >>> >>> > ----------------- > >>> >>> > In this case "All is Well" till ios are served from cache > >>> >>> > (XFS is smart enough to cache some data ) . Once started hitting > >>> >>> > disks and throughput is decreasing. As you can see, disk is giving > >>> >>> > ~35K iops , but, OSD throughput is only ~18.8K ! So, cache miss in > >>> >>> > case of buffered io seems to be very expensive. Half of the iops > >>> >>> > are waste. Also, looking at the bandwidth, it is obvious, not > >>> >>> > everything is 4K read, May be kernel read_ahead is kicking (?). > >>> >>> > > >>> >>> > > >>> >>> > Now, I thought of making ceph disk read as direct_io and do the > >>> >>> > same experiment. I have changed the FileStore::read to do the > >>> >>> > direct_io only. Rest kept as is. Here is the result with that. > >>> >>> > > >>> >>> > > >>> >>> > Iostat: > >>> >>> > ------- > >>> >>> > > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle > >>> >>> > 24.77 0.00 19.52 21.36 0.00 34.36 > >>> >>> > > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util > >>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdh 0.00 0.00 25295.00 0.00 101180.00 0.00 > >>> >>> > 8.00 12.73 0.50 0.50 0.00 0.04 100.80 > >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > > >>> >>> > ceph -s: > >>> >>> > -------- > >>> >>> > root@emsclient:~/fio_test# ceph -s > >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 > >>> >>> > health HEALTH_OK > >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch > >>> >>> > 1, quorum 0 a > >>> >>> > osdmap e522: 1 osds: 1 up, 1 in > >>> >>> > pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects > >>> >>> > 366 GB used, 1122 GB / 1489 GB avail > >>> >>> > 832 active+clean > >>> >>> > client io 100 MB/s rd, 25618 op/s > >>> >>> > > >>> >>> > cpu util: > >>> >>> > -------- > >>> >>> > ~14 core while serving from disks. > >>> >>> > > >>> >>> > My Analysis: > >>> >>> > --------------- > >>> >>> > No surprises here. Whatever is disk throughput ceph throughput is > >>> >>> > almost matching. > >>> >>> > > >>> >>> > > >>> >>> > Let's tweak the shard/thread settings and see the impact. > >>> >>> > > >>> >>> > > >>> >>> > 2. OSD config with 36 shards and 1 thread/shard: > >>> >>> > ----------------------------------------------------------- > >>> >>> > > >>> >>> > Buffered read: > >>> >>> > ------------------ > >>> >>> > No change, output is very similar to 25 shards. > >>> >>> > > >>> >>> > > >>> >>> > direct_io read: > >>> >>> > ------------------ > >>> >>> > Iostat: > >>> >>> > ---------- > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle > >>> >>> > 33.33 0.00 28.22 23.11 0.00 15.34 > >>> >>> > > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util > >>> >>> > sda 0.00 0.00 0.00 2.00 0.00 12.00 > >>> >>> > 12.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdh 0.00 0.00 31987.00 0.00 127948.00 0.00 > >>> >>> > 8.00 18.06 0.56 0.56 0.00 0.03 100.40 > >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > > >>> >>> > ceph -s: > >>> >>> > -------------- > >>> >>> > root@emsclient:~/fio_test# ceph -s > >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 > >>> >>> > health HEALTH_OK > >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch > >>> >>> > 1, quorum 0 a > >>> >>> > osdmap e525: 1 osds: 1 up, 1 in > >>> >>> > pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects > >>> >>> > 366 GB used, 1122 GB / 1489 GB avail > >>> >>> > 832 active+clean > >>> >>> > client io 127 MB/s rd, 32763 op/s > >>> >>> > > >>> >>> > cpu util: > >>> >>> > -------------- > >>> >>> > ~19 core while serving from disks. > >>> >>> > > >>> >>> > Analysis: > >>> >>> > ------------------ > >>> >>> > It is scaling with increased number of shards/threads. The > >>> >>> > parallelism also increased significantly. > >>> >>> > > >>> >>> > > >>> >>> > 3. OSD config with 48 shards and 1 thread/shard: > >>> >>> > ---------------------------------------------------------- > >>> >>> > Buffered read: > >>> >>> > ------------------- > >>> >>> > No change, output is very similar to 25 shards. > >>> >>> > > >>> >>> > > >>> >>> > direct_io read: > >>> >>> > ----------------- > >>> >>> > Iostat: > >>> >>> > -------- > >>> >>> > > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle > >>> >>> > 37.50 0.00 33.72 20.03 0.00 8.75 > >>> >>> > > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util > >>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdh 0.00 0.00 35360.00 0.00 141440.00 0.00 > >>> >>> > 8.00 22.25 0.62 0.62 0.00 0.03 100.40 > >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > > >>> >>> > ceph -s: > >>> >>> > -------------- > >>> >>> > root@emsclient:~/fio_test# ceph -s > >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 > >>> >>> > health HEALTH_OK > >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch > >>> >>> > 1, quorum 0 a > >>> >>> > osdmap e534: 1 osds: 1 up, 1 in > >>> >>> > pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects > >>> >>> > 366 GB used, 1122 GB / 1489 GB avail > >>> >>> > 832 active+clean > >>> >>> > client io 138 MB/s rd, 35582 op/s > >>> >>> > > >>> >>> > cpu util: > >>> >>> > ---------------- > >>> >>> > ~22.5 core while serving from disks. > >>> >>> > > >>> >>> > Analysis: > >>> >>> > -------------------- > >>> >>> > It is scaling with increased number of shards/threads. The > >>> >>> > parallelism also increased significantly. > >>> >>> > > >>> >>> > > >>> >>> > > >>> >>> > 4. OSD config with 64 shards and 1 thread/shard: > >>> >>> > --------------------------------------------------------- > >>> >>> > Buffered read: > >>> >>> > ------------------ > >>> >>> > No change, output is very similar to 25 shards. > >>> >>> > > >>> >>> > > >>> >>> > direct_io read: > >>> >>> > ------------------- > >>> >>> > Iostat: > >>> >>> > --------- > >>> >>> > avg-cpu: %user %nice %system %iowait %steal %idle > >>> >>> > 40.18 0.00 34.84 19.81 0.00 5.18 > >>> >>> > > >>> >>> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > >>> >>> > avgrq-sz avgqu-sz await r_await w_await svctm %util > >>> >>> > sda 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdd 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sde 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdg 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdf 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdh 0.00 0.00 39114.00 0.00 156460.00 0.00 > >>> >>> > 8.00 35.58 0.90 0.90 0.00 0.03 100.40 > >>> >>> > sdc 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > sdb 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >>> >>> > > >>> >>> > ceph -s: > >>> >>> > --------------- > >>> >>> > root@emsclient:~/fio_test# ceph -s > >>> >>> > cluster 94991097-7638-4240-b922-f525300a9026 > >>> >>> > health HEALTH_OK > >>> >>> > monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch > >>> >>> > 1, quorum 0 a > >>> >>> > osdmap e537: 1 osds: 1 up, 1 in > >>> >>> > pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects > >>> >>> > 366 GB used, 1122 GB / 1489 GB avail > >>> >>> > 832 active+clean > >>> >>> > client io 153 MB/s rd, 39172 op/s > >>> >>> > > >>> >>> > cpu util: > >>> >>> > ---------------- > >>> >>> > ~24.5 core while serving from disks. ~3% cpu left. > >>> >>> > > >>> >>> > Analysis: > >>> >>> > ------------------ > >>> >>> > It is scaling with increased number of shards/threads. The > >>> >>> > parallelism also increased significantly. It is disk bound now. > >>> >>> > > >>> >>> > > >>> >>> > Summary: > >>> >>> > > >>> >>> > So, it seems buffered IO has significant impact on performance in > >>> >>> > case backend is SSD. > >>> >>> > My question is, if the workload is very random and storage(SSD) is > >>> >>> > very huge compare to system memory, shouldn't we always go for > >>> >>> > direct_io instead of buffered io from Ceph ? > >>> >>> > > >>> >>> > Please share your thoughts/suggestion on this. > >>> >>> > > >>> >>> > Thanks & Regards > >>> >>> > Somnath > >>> >>> > > >>> >>> > ________________________________ > >>> >>> > > >>> >>> > PLEASE NOTE: The information contained in this electronic mail > >>> >>> > message is intended only for the use of the designated recipient(s) > >>> >>> > named above. If the reader of this message is not the intended > >>> >>> > recipient, you are hereby notified that you have received this > >>> >>> > message in error and that any review, dissemination, distribution, > >>> >>> > or copying of this message is strictly prohibited. If you have > >>> >>> > received this communication in error, please notify the sender by > >>> >>> > telephone or e-mail (as shown above) immediately and destroy any > >>> >>> > and all copies of this message in your possession (whether hard > >>> >>> > copies or electronically stored copies). > >>> >>> > > >>> >>> > -- > >>> >>> > To unsubscribe from this list: send the line "unsubscribe > >>> >>> > ceph-devel" > >>> >>> > in the body of a message to majord...@vger.kernel.org More > >>> >>> > majordomo info at http://vger.kernel.org/majordomo-info.html > >>> >>> > >>> >>> > >>> >>> > >>> >>> -- > >>> >>> Milosz Tanski > >>> >>> CTO > >>> >>> 16 East 34th Street, 15th floor > >>> >>> New York, NY 10016 > >>> >>> > >>> >>> p: 646-253-9055 > >>> >>> e: mil...@adfin.com > >>> >>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? > >>> >>> ?w??? ???j:+v???w???????? ????zZ+???????j"????i > >>> >> -- > >>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>> >> in the body of a message to majord...@vger.kernel.org More > >>> >> majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > > >>> > > >>> > > >>> > -- > >>> > Best Regards, > >>> > > >>> > Wheat > >>> > >>> > >>> > >>> -- > >>> Best Regards, > >>> > >>> Wheat > >>> > >>> > > > > > > > > -- > > Best Regards, > > > > Wheat > > > > -- > Best Regards, > > Wheat > N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??? > ???j:+v???w????????????zZ+???????j"????i -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html