RE: Initial newstore vs filestore results

Duan, Jiangang Sat, 11 Apr 2015 06:23:58 -0700

Thanks. 

-----Original Message-----
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Saturday, April 11, 2015 8:09 AM
To: Duan, Jiangang; Sage Weil; Ning Yao
Cc: ceph-devel
Subject: Re: Initial newstore vs filestore results


Hi Jiangang,

These specific tests are 512K random writes using fio with the librbd engine 
and iodepth of 64.  RBD volumes have been pre-allocated.  There's no file 
system present.

I also collected results for 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1024k, 
2048k, and 4096k for random and and sequential writes with different overlay 
sizes:

http://nhm.ceph.com/newstore/20150409/

client side performance graphs were posted earlier in the thread here:

http://marc.info/?l=ceph-devel&m=142868123431724&w=2

Mark

On 04/10/2015 06:43 PM, Duan, Jiangang wrote:
> Mark, What is the workload pattern for below data? Small IO or big IO? New 
> file or in-place update in RBD?
>
> Filestore does a lot of reads and writes to a couple of specific portions of 
> the device and has peaks/valleys when data gets written out in bulk.  I would 
> have expected to see more sequential looking writes during the peaks due to 
> journal writes and no reads to that portion of the disk, but it seems murkier 
> to me than that.
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> _OSD0.mpg
>
> newstore+no_overlay does kind of a flurry of random IO and looks like
> it's somewhat seek bound.  It's very consistent but actual write performance 
> is low compared to what blktrace reports as the data hitting the disk.  
> Something happening toward the beginning of the drive too.
>
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
> newstore+8m overlay is interesting.  Lots of data gets written out to
> the disk in seemingly large chunks but the actual throughput as reported by 
> the client is very slow.  I assume there's tons of write amplification 
> happening as rocksdb moves the 512k objects around into different levels.
>
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
>
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Saturday, April 11, 2015 4:05 AM
> To: Sage Weil; Ning Yao
> Cc: Duan, Jiangang; ceph-devel
> Subject: Re: Initial newstore vs filestore results
>
> Notice for instance a comparison of random 512k writes between filestore, 
> newstore with no overlay, and newstore with 8m overlay:
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> .png 
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e.png 
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e.png
>
> The client rbd throughput as reported by fio is:
>
> filestore: 20.44MB/s
> newstore+no_overlay: 4.35MB/s
> newstore+8m_overlay: 3.86MB/s
>
> But notice that in the graphs, we see very different behaviors on disk.
>
> Filestore does a lot of reads and writes to a couple of specific portions of 
> the device and has peaks/valleys when data gets written out in bulk.  I would 
> have expected to see more sequential looking writes during the peaks due to 
> journal writes and no reads to that portion of the disk, but it seems murkier 
> to me than that.
>
> http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite
> _OSD0.mpg
>
> newstore+no_overlay does kind of a flurry of random IO and looks like
> it's somewhat seek bound.  It's very consistent but actual write performance 
> is low compared to what blktrace reports as the data hitting the disk.  
> Something happening toward the beginning of the drive too.
>
> http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
> newstore+8m overlay is interesting.  Lots of data gets written out to
> the disk in seemingly large chunks but the actual throughput as reported by 
> the client is very slow.  I assume there's tons of write amplification 
> happening as rocksdb moves the 512k objects around into different levels.
>
> http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit
> e_OSD0.mpg
>
> Mark
>
> On 04/10/2015 02:41 PM, Mark Nelson wrote:
>> Seekwatcher movies and graphs finally finished generating for all of 
>> the
>> tests:
>>
>> http://nhm.ceph.com/newstore/20150409/
>>
>> Mark
>>
>> On 04/10/2015 10:53 AM, Mark Nelson wrote:
>>> Test results attached for different overlay settings at various IO 
>>> sizes for writes and random writes.  Basically it looks like as we 
>>> increase the overlay size it changes the curve.  So far we're still 
>>> not doing as good as the filestore (co-located journal) though.
>>>
>>> I imagine the WAL probably does play a big part here.
>>>
>>> Mark
>>>
>>> On 04/10/2015 10:28 AM, Sage Weil wrote:
>>>> On Fri, 10 Apr 2015, Ning Yao wrote:
>>>>> KV store introduces too much write amplification, we may need 
>>>>> self-implemented WAL?
>>>>
>>>> What we really want is to hint to the kv store that these keys (or 
>>>> this key range) is short-lived and should never get compacted.
>>>> And/or, we need to just make sure the wal is sufficiently large so 
>>>> that in practice that never happens to those keys.
>>>>
>>>> Putting them outside the kv store means an additional seek/sync for 
>>>> disks, which defeats most of the purpose.  Maybe it makes sense for 
>>>> flash...
>>>> but
>>>> the above avoids the problem in either case.
>>>>
>>>> I think we should target rocksdb for our initial tuning attempts.
>>>> So far all I've done is played a bit with the file size (1mb -> 4mb
>>>> -> 8mb) but my ad hoc tests didn't see much difference.
>>>>
>>>> sage
>>>>
>>>>
>>>>
>>>>> Regards
>>>>> Ning Yao
>>>>>
>>>>>
>>>>> 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.d...@intel.com>:
>>>>>> IMHO, the newstore performance depends so much on KV store 
>>>>>> performance due to the WAL -  so pick up the right KV or tune it 
>>>>>> will be the 1st step to do.
>>>>>>
>>>>>> -jiangang
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-ow...@vger.kernel.org 
>>>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark 
>>>>>> Nelson
>>>>>> Sent: Friday, April 10, 2015 1:01 AM
>>>>>> To: Sage Weil
>>>>>> Cc: ceph-devel
>>>>>> Subject: Re: Initial newstore vs filestore results
>>>>>>
>>>>>> On 04/08/2015 10:19 PM, Mark Nelson wrote:
>>>>>>> On 04/07/2015 09:58 PM, Sage Weil wrote:
>>>>>>>> What would be very interesting would be to see the 4KB 
>>>>>>>> performance with the defaults (newstore overlay max = 32) vs 
>>>>>>>> overlays disabled (newstore overlay max = 0) and see if/how much it is 
>>>>>>>> helping.
>>>>>>>
>>>>>>> And here we go.  1 OSD, 1X replication.  16GB RBD volume.
>>>>>>>
>>>>>>> 4MB        write    read    randw    randr
>>>>>>> default overlay    36.13    106.61    34.49    92.69
>>>>>>> no overlay    36.29    105.61    34.49    93.55
>>>>>>>
>>>>>>> 128KB        write    read    randw    randr
>>>>>>> default overlay    1.71    97.90    1.65    25.79
>>>>>>> no overlay    1.72    97.80    1.66    25.78
>>>>>>>
>>>>>>> 4KB        write    read    randw    randr
>>>>>>> default overlay    0.40    61.88    1.29    1.11
>>>>>>> no overlay    0.05    61.26    0.05    1.10
>>>>>>>
>>>>>>
>>>>>> Update this morning.  Also ran filestore tests for comparison.
>>>>>> Next we'll look at how tweaking the overlay for different IO 
>>>>>> sizes affects things.  IE the overlay threshold is 64k right now 
>>>>>> and it appears that 128K write IOs for instance are quite a bit 
>>>>>> worse with newstore currently than with filestore.  Sage also 
>>>>>> just committed changes that will allow overlay writes during 
>>>>>> append/create which may help improve small IO write performance as well 
>>>>>> in some cases.
>>>>>>
>>>>>> 4MB             write   read    randw   randr
>>>>>> default overlay 36.13   106.61  34.49   92.69
>>>>>> no overlay      36.29   105.61  34.49   93.55
>>>>>> filestore       36.17   84.59   34.11   79.85
>>>>>>
>>>>>> 128KB           write   read    randw   randr
>>>>>> default overlay 1.71    97.90   1.65    25.79
>>>>>> no overlay      1.72    97.80   1.66    25.78
>>>>>> filestore       27.15   79.91   8.77    19.00
>>>>>>
>>>>>> 4KB             write   read    randw   randr
>>>>>> default overlay 0.40    61.88   1.29    1.11
>>>>>> no overlay      0.05    61.26   0.05    1.10
>>>>>> filestore       4.14    56.30   0.42    0.76
>>>>>>
>>>>>> Seekwatcher movies and graphs available here:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/
>>>>>>
>>>>>> Note for instance the very interesting blktrace patterns for 4K 
>>>>>> random writes on the OSD in each case:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_rand
>>>>>> w
>>>>>> rite.png
>>>>>>
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_0000409
>>>>>> 6
>>>>>> _randwrite.png
>>>>>>
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_ran
>>>>>> d
>>>>>> write.png
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>> ceph-devel" in the body of a message to majord...@vger.kernel.org 
>>>>>> More majordomo info at  
>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>> ceph-devel" in the body of a message to majord...@vger.kernel.org 
>>>>>> More majordomo info at  
>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Initial newstore vs filestore results

Reply via email to