RE: Impact of page cache on OSD read performance for SSD

Sage Weil Tue, 23 Sep 2014 12:30:14 -0700

On Tue, 23 Sep 2014, Somnath Roy wrote:
> Milosz,
> Thanks for the response. I will see if I can get any information out of perf.
> 
> Here is my OS information.
> 
> root@emsclient:~# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 13.10
> Release:        13.10
> Codename:       saucy
> root@emsclient:~# uname -a
> Linux emsclient 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> BTW, it's not a 45% drop, as you can see, by tuning the OSD parameter I was 
> able to get almost *2X* performance improvement with direct_io. 
> It's not only page cache (memory) lookup, in case of buffered_io  the 
> following could be problem.
> 
> 1. Double copy (disk -> file buffer cache, file buffer cache -> user buffer)
> 
> 2. As the iostat output shows, it is not reading 4K only, it is reading 
> more data from disk as required and in the end it will be wasted in case 
> of random workload..


It might be worth using blktrace to see what the IOs it is issueing are.  
Which ones are > 4K and what they point to...

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Milosz Tanski [mailto:mil...@adfin.com] 
> Sent: Tuesday, September 23, 2014 12:09 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: Impact of page cache on OSD read performance for SSD
> 
> Somnath,
> 
> I wonder if there's a bottleneck or a point of contention for the kernel. For 
> a entirely uncached workload I expect the page cache lookup to cause a slow 
> down (since the lookup should be wasted). What I wouldn't expect is a 45% 
> performance drop. Memory speed should be one magnitude faster then a modern 
> SATA SSD drive (so it should be more negligible overhead).
> 
> Is there anyway you could perform the same test but monitor what's going on 
> with the OSD process using the perf tool? Whatever is the default cpu time 
> spent hardware counter is fine. Make sure you have the kernel debug info 
> package installed so can get symbol information for kernel and module calls. 
> With any luck the diff in perf output in two runs will show us the culprit.
> 
> Also, can you tell us what OS/kernel version you're using on the OSD machines?
> 
> - Milosz
> 
> On Tue, Sep 23, 2014 at 2:05 PM, Somnath Roy <somnath....@sandisk.com> wrote:
> > Hi Sage,
> > I have created the following setup in order to examine how a single OSD is 
> > behaving if say ~80-90% of ios hitting the SSDs.
> >
> > My test includes the following steps.
> >
> >         1. Created a single OSD cluster.
> >         2. Created two rbd images (110GB each) on 2 different pools.
> >         3. Populated entire image, so my working set is ~210GB. My system 
> > memory is ~16GB.
> >         4. Dumped page cache before every run.
> >         5. Ran fio_rbd (QD 32, 8 instances) in parallel on these two images.
> >
> > Here is my disk iops/bandwidth..
> >
> >         root@emsclient:~/fio_test# fio rad_resd_disk.job
> >         random-reads: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, 
> > iodepth=64
> >         2.0.8
> >         Starting 1 process
> >         Jobs: 1 (f=1): [r] [100.0% done] [154.1M/0K /s] [39.7K/0  iops] 
> > [eta 00m:00s]
> >         random-reads: (groupid=0, jobs=1): err= 0: pid=1431
> >         read : io=9316.4MB, bw=158994KB/s, iops=39748 , runt= 
> > 60002msec
> >
> > My fio_rbd config..
> >
> > [global]
> > ioengine=rbd
> > clientname=admin
> > pool=rbd1
> > rbdname=ceph_regression_test1
> > invalidate=0    # mandatory
> > rw=randread
> > bs=4k
> > direct=1
> > time_based
> > runtime=2m
> > size=109G
> > numjobs=8
> > [rbd_iodepth32]
> > iodepth=32
> >
> > Now, I have run Giant Ceph on top of that..
> >
> > 1. OSD config with 25 shards/1 thread per shard :
> > -------------------------------------------------------
> >
> >          avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           22.04    0.00   16.46   45.86    0.00   15.64
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> > avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     9.00    0.00    6.00     0.00    92.00    30.67  
> >    0.01    1.33    0.00    1.33   1.33   0.80
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdh             181.00     0.00 34961.00    0.00 176740.00     0.00    
> > 10.11   102.71    2.92    2.92    0.00   0.03 100.00
> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> >
> >
> > ceph -s:
> >  ----------
> > root@emsclient:~# ceph -s
> >     cluster 94991097-7638-4240-b922-f525300a9026
> >      health HEALTH_OK
> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
> > quorum 0 a
> >      osdmap e498: 1 osds: 1 up, 1 in
> >       pgmap v386366: 832 pgs, 7 pools, 308 GB data, 247 kobjects
> >             366 GB used, 1122 GB / 1489 GB avail
> >                  832 active+clean
> >   client io 75215 kB/s rd, 18803 op/s
> >
> >  cpu util:
> > ----------
> >  Gradually decreases from ~21 core (serving from cache) to ~10 core (while 
> > serving from disks).
> >
> >  My Analysis:
> > -----------------
> >  In this case "All is Well"  till ios are served from cache (XFS is 
> > smart enough to cache some data ) . Once started hitting disks and 
> > throughput is decreasing. As you can see, disk is giving ~35K iops , but, 
> > OSD throughput is only ~18.8K ! So, cache miss in case of buffered io seems 
> > to be very  expensive.  Half of the iops are waste. Also, looking at the 
> > bandwidth, it is obvious, not everything is 4K read, May be kernel 
> > read_ahead is kicking (?).
> >
> >
> > Now, I thought of making ceph disk read as direct_io and do the same 
> > experiment. I have changed the FileStore::read to do the direct_io only. 
> > Rest kept as is. Here is the result with that.
> >
> >
> > Iostat:
> > -------
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           24.77    0.00   19.52   21.36    0.00   34.36
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> > avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdh               0.00     0.00 25295.00    0.00 101180.00     0.00     
> > 8.00    12.73    0.50    0.50    0.00   0.04 100.80
> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> >
> > ceph -s:
> >  --------
> > root@emsclient:~/fio_test# ceph -s
> >     cluster 94991097-7638-4240-b922-f525300a9026
> >      health HEALTH_OK
> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
> > quorum 0 a
> >      osdmap e522: 1 osds: 1 up, 1 in
> >       pgmap v386711: 832 pgs, 7 pools, 308 GB data, 247 kobjects
> >             366 GB used, 1122 GB / 1489 GB avail
> >                  832 active+clean
> >   client io 100 MB/s rd, 25618 op/s
> >
> > cpu util:
> > --------
> >   ~14 core while serving from disks.
> >
> >  My Analysis:
> >  ---------------
> > No surprises here. Whatever is disk throughput ceph throughput is almost 
> > matching.
> >
> >
> > Let's tweak the shard/thread settings and see the impact.
> >
> >
> > 2. OSD config with 36 shards and 1 thread/shard:
> > -----------------------------------------------------------
> >
> >    Buffered read:
> >    ------------------
> >   No change, output is very similar to 25 shards.
> >
> >
> >   direct_io read:
> >   ------------------
> >        Iostat:
> >       ----------
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           33.33    0.00   28.22   23.11    0.00   15.34
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> > avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    2.00     0.00    12.00    12.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdh               0.00     0.00 31987.00    0.00 127948.00     0.00     
> > 8.00    18.06    0.56    0.56    0.00   0.03 100.40
> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> >
> >        ceph -s:
> >     --------------
> > root@emsclient:~/fio_test# ceph -s
> >     cluster 94991097-7638-4240-b922-f525300a9026
> >      health HEALTH_OK
> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
> > quorum 0 a
> >      osdmap e525: 1 osds: 1 up, 1 in
> >       pgmap v386746: 832 pgs, 7 pools, 308 GB data, 247 kobjects
> >             366 GB used, 1122 GB / 1489 GB avail
> >                  832 active+clean
> >   client io 127 MB/s rd, 32763 op/s
> >
> >         cpu util:
> >    --------------
> >        ~19 core while serving from disks.
> >
> >          Analysis:
> > ------------------
> >         It is scaling with increased number of shards/threads. The 
> > parallelism also increased significantly.
> >
> >
> > 3. OSD config with 48 shards and 1 thread/shard:
> >  ----------------------------------------------------------
> >     Buffered read:
> >    -------------------
> >     No change, output is very similar to 25 shards.
> >
> >
> >    direct_io read:
> >     -----------------
> >        Iostat:
> >       --------
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           37.50    0.00   33.72   20.03    0.00    8.75
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> > avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdh               0.00     0.00 35360.00    0.00 141440.00     0.00     
> > 8.00    22.25    0.62    0.62    0.00   0.03 100.40
> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> >
> >          ceph -s:
> >        --------------
> > root@emsclient:~/fio_test# ceph -s
> >     cluster 94991097-7638-4240-b922-f525300a9026
> >      health HEALTH_OK
> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
> > quorum 0 a
> >      osdmap e534: 1 osds: 1 up, 1 in
> >       pgmap v386830: 832 pgs, 7 pools, 308 GB data, 247 kobjects
> >             366 GB used, 1122 GB / 1489 GB avail
> >                  832 active+clean
> >   client io 138 MB/s rd, 35582 op/s
> >
> >          cpu util:
> >  ----------------
> >         ~22.5 core while serving from disks.
> >
> >           Analysis:
> >  --------------------
> >         It is scaling with increased number of shards/threads. The 
> > parallelism also increased significantly.
> >
> >
> >
> > 4. OSD config with 64 shards and 1 thread/shard:
> >  ---------------------------------------------------------
> >       Buffered read:
> >      ------------------
> >      No change, output is very similar to 25 shards.
> >
> >
> >      direct_io read:
> >      -------------------
> >        Iostat:
> >       ---------
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >           40.18    0.00   34.84   19.81    0.00    5.18
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> > avgqu-sz   await r_await w_await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdh               0.00     0.00 39114.00    0.00 156460.00     0.00     
> > 8.00    35.58    0.90    0.90    0.00   0.03 100.40
> > sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> > sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00  
> >    0.00    0.00    0.00    0.00   0.00   0.00
> >
> >        ceph -s:
> >  ---------------
> > root@emsclient:~/fio_test# ceph -s
> >     cluster 94991097-7638-4240-b922-f525300a9026
> >      health HEALTH_OK
> >      monmap e1: 1 mons at {a=10.196.123.24:6789/0}, election epoch 1, 
> > quorum 0 a
> >      osdmap e537: 1 osds: 1 up, 1 in
> >       pgmap v386865: 832 pgs, 7 pools, 308 GB data, 247 kobjects
> >             366 GB used, 1122 GB / 1489 GB avail
> >                  832 active+clean
> >   client io 153 MB/s rd, 39172 op/s
> >
> >       cpu util:
> > ----------------
> >     ~24.5 core while serving from disks. ~3% cpu left.
> >
> >        Analysis:
> > ------------------
> >       It is scaling with increased number of shards/threads. The 
> > parallelism also increased significantly. It is disk bound now.
> >
> >
> > Summary:
> >
> > So, it seems buffered IO has significant impact on performance in case 
> > backend is SSD.
> > My question is,  if the workload is very random and storage(SSD) is very 
> > huge compare to system memory, shouldn't we always go for direct_io instead 
> > of buffered io from Ceph ?
> >
> > Please share your thoughts/suggestion on this.
> >
> > Thanks & Regards
> > Somnath
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message is 
> > intended only for the use of the designated recipient(s) named above. If 
> > the reader of this message is not the intended recipient, you are hereby 
> > notified that you have received this message in error and that any review, 
> > dissemination, distribution, or copying of this message is strictly 
> > prohibited. If you have received this communication in error, please notify 
> > the sender by telephone or e-mail (as shown above) immediately and destroy 
> > any and all copies of this message in your possession (whether hard copies 
> > or electronically stored copies).
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majord...@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
> 
> p: 646-253-9055
> e: mil...@adfin.com
> N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w???
> ???j:+v???w????????????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Impact of page cache on OSD read performance for SSD

Reply via email to