Thanks. -----Original Message----- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Saturday, April 11, 2015 8:09 AM To: Duan, Jiangang; Sage Weil; Ning Yao Cc: ceph-devel Subject: Re: Initial newstore vs filestore results
Hi Jiangang, These specific tests are 512K random writes using fio with the librbd engine and iodepth of 64. RBD volumes have been pre-allocated. There's no file system present. I also collected results for 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1024k, 2048k, and 4096k for random and and sequential writes with different overlay sizes: http://nhm.ceph.com/newstore/20150409/ client side performance graphs were posted earlier in the thread here: http://marc.info/?l=ceph-devel&m=142868123431724&w=2 Mark On 04/10/2015 06:43 PM, Duan, Jiangang wrote: > Mark, What is the workload pattern for below data? Small IO or big IO? New > file or in-place update in RBD? > > Filestore does a lot of reads and writes to a couple of specific portions of > the device and has peaks/valleys when data gets written out in bulk. I would > have expected to see more sequential looking writes during the peaks due to > journal writes and no reads to that portion of the disk, but it seems murkier > to me than that. > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite > _OSD0.mpg > > newstore+no_overlay does kind of a flurry of random IO and looks like > it's somewhat seek bound. It's very consistent but actual write performance > is low compared to what blktrace reports as the data hitting the disk. > Something happening toward the beginning of the drive too. > > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit > e_OSD0.mpg > > newstore+8m overlay is interesting. Lots of data gets written out to > the disk in seemingly large chunks but the actual throughput as reported by > the client is very slow. I assume there's tons of write amplification > happening as rocksdb moves the 512k objects around into different levels. > > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit > e_OSD0.mpg > > > -----Original Message----- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson > Sent: Saturday, April 11, 2015 4:05 AM > To: Sage Weil; Ning Yao > Cc: Duan, Jiangang; ceph-devel > Subject: Re: Initial newstore vs filestore results > > Notice for instance a comparison of random 512k writes between filestore, > newstore with no overlay, and newstore with 8m overlay: > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite > .png > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit > e.png > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit > e.png > > The client rbd throughput as reported by fio is: > > filestore: 20.44MB/s > newstore+no_overlay: 4.35MB/s > newstore+8m_overlay: 3.86MB/s > > But notice that in the graphs, we see very different behaviors on disk. > > Filestore does a lot of reads and writes to a couple of specific portions of > the device and has peaks/valleys when data gets written out in bulk. I would > have expected to see more sequential looking writes during the peaks due to > journal writes and no reads to that portion of the disk, but it seems murkier > to me than that. > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite > _OSD0.mpg > > newstore+no_overlay does kind of a flurry of random IO and looks like > it's somewhat seek bound. It's very consistent but actual write performance > is low compared to what blktrace reports as the data hitting the disk. > Something happening toward the beginning of the drive too. > > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit > e_OSD0.mpg > > newstore+8m overlay is interesting. Lots of data gets written out to > the disk in seemingly large chunks but the actual throughput as reported by > the client is very slow. I assume there's tons of write amplification > happening as rocksdb moves the 512k objects around into different levels. > > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit > e_OSD0.mpg > > Mark > > On 04/10/2015 02:41 PM, Mark Nelson wrote: >> Seekwatcher movies and graphs finally finished generating for all of >> the >> tests: >> >> http://nhm.ceph.com/newstore/20150409/ >> >> Mark >> >> On 04/10/2015 10:53 AM, Mark Nelson wrote: >>> Test results attached for different overlay settings at various IO >>> sizes for writes and random writes. Basically it looks like as we >>> increase the overlay size it changes the curve. So far we're still >>> not doing as good as the filestore (co-located journal) though. >>> >>> I imagine the WAL probably does play a big part here. >>> >>> Mark >>> >>> On 04/10/2015 10:28 AM, Sage Weil wrote: >>>> On Fri, 10 Apr 2015, Ning Yao wrote: >>>>> KV store introduces too much write amplification, we may need >>>>> self-implemented WAL? >>>> >>>> What we really want is to hint to the kv store that these keys (or >>>> this key range) is short-lived and should never get compacted. >>>> And/or, we need to just make sure the wal is sufficiently large so >>>> that in practice that never happens to those keys. >>>> >>>> Putting them outside the kv store means an additional seek/sync for >>>> disks, which defeats most of the purpose. Maybe it makes sense for >>>> flash... >>>> but >>>> the above avoids the problem in either case. >>>> >>>> I think we should target rocksdb for our initial tuning attempts. >>>> So far all I've done is played a bit with the file size (1mb -> 4mb >>>> -> 8mb) but my ad hoc tests didn't see much difference. >>>> >>>> sage >>>> >>>> >>>> >>>>> Regards >>>>> Ning Yao >>>>> >>>>> >>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.d...@intel.com>: >>>>>> IMHO, the newstore performance depends so much on KV store >>>>>> performance due to the WAL - so pick up the right KV or tune it >>>>>> will be the 1st step to do. >>>>>> >>>>>> -jiangang >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: ceph-devel-ow...@vger.kernel.org >>>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark >>>>>> Nelson >>>>>> Sent: Friday, April 10, 2015 1:01 AM >>>>>> To: Sage Weil >>>>>> Cc: ceph-devel >>>>>> Subject: Re: Initial newstore vs filestore results >>>>>> >>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote: >>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote: >>>>>>>> What would be very interesting would be to see the 4KB >>>>>>>> performance with the defaults (newstore overlay max = 32) vs >>>>>>>> overlays disabled (newstore overlay max = 0) and see if/how much it is >>>>>>>> helping. >>>>>>> >>>>>>> And here we go. 1 OSD, 1X replication. 16GB RBD volume. >>>>>>> >>>>>>> 4MB write read randw randr >>>>>>> default overlay 36.13 106.61 34.49 92.69 >>>>>>> no overlay 36.29 105.61 34.49 93.55 >>>>>>> >>>>>>> 128KB write read randw randr >>>>>>> default overlay 1.71 97.90 1.65 25.79 >>>>>>> no overlay 1.72 97.80 1.66 25.78 >>>>>>> >>>>>>> 4KB write read randw randr >>>>>>> default overlay 0.40 61.88 1.29 1.11 >>>>>>> no overlay 0.05 61.26 0.05 1.10 >>>>>>> >>>>>> >>>>>> Update this morning. Also ran filestore tests for comparison. >>>>>> Next we'll look at how tweaking the overlay for different IO >>>>>> sizes affects things. IE the overlay threshold is 64k right now >>>>>> and it appears that 128K write IOs for instance are quite a bit >>>>>> worse with newstore currently than with filestore. Sage also >>>>>> just committed changes that will allow overlay writes during >>>>>> append/create which may help improve small IO write performance as well >>>>>> in some cases. >>>>>> >>>>>> 4MB write read randw randr >>>>>> default overlay 36.13 106.61 34.49 92.69 >>>>>> no overlay 36.29 105.61 34.49 93.55 >>>>>> filestore 36.17 84.59 34.11 79.85 >>>>>> >>>>>> 128KB write read randw randr >>>>>> default overlay 1.71 97.90 1.65 25.79 >>>>>> no overlay 1.72 97.80 1.66 25.78 >>>>>> filestore 27.15 79.91 8.77 19.00 >>>>>> >>>>>> 4KB write read randw randr >>>>>> default overlay 0.40 61.88 1.29 1.11 >>>>>> no overlay 0.05 61.26 0.05 1.10 >>>>>> filestore 4.14 56.30 0.42 0.76 >>>>>> >>>>>> Seekwatcher movies and graphs available here: >>>>>> >>>>>> http://nhm.ceph.com/newstore/20150408/ >>>>>> >>>>>> Note for instance the very interesting blktrace patterns for 4K >>>>>> random writes on the OSD in each case: >>>>>> >>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_rand >>>>>> w >>>>>> rite.png >>>>>> >>>>>> >>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_0000409 >>>>>> 6 >>>>>> _randwrite.png >>>>>> >>>>>> >>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_ran >>>>>> d >>>>>> write.png >>>>>> >>>>>> >>>>>> >>>>>> Mark >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>> ceph-devel" in the body of a message to majord...@vger.kernel.org >>>>>> More majordomo info at >>>>>> http://vger.kernel.org/majordomo-info.html >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>> ceph-devel" in the body of a message to majord...@vger.kernel.org >>>>>> More majordomo info at >>>>>> http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majord...@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majord...@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html