Haomoi,

Is there any write up on keyvalue store header cache and strip size? Based on 
what you stated, it appears that strip size improves performance with large 
object sizes. How would header cache impact 4KB object sizes?
We'd like to guesstimate the improvement due to strip size and header cache. 
I'm not sure about header cache implementation yet, but fdcache had 
serialization issues and there was a sharded fdcache to address this (under 
review, I guess).

I believe the header_lock serialization exists in all ceph branches so far, 
including the master.

Thanks,
Sushma

-----Original Message-----
From: Haomai Wang [mailto:haomaiw...@gmail.com] 
Sent: Tuesday, July 01, 2014 1:06 AM
To: Somnath Roy
Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; 
ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support

Hi,

I don't know why OSD capacity can be PB level. Actually, most of use case 
should be serval TBs(1-4TB). As for cache hit, it totally depend on the IO 
characteristic. In my opinion, header cache in KeyValueStore can meet hit cache 
mostly if config object size and strip
size(KeyValueStore) properly.

But I'm also interested in your lock comments, what ceph version do you 
estimate with serialization issue?

On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <somnath....@sandisk.com> wrote:
> Hi Haomai,
> But, the cache hit will be very minimal or null, if the actual storage per 
> node is very huge (say in the PB level). So, it will be mostly hitting Omap, 
> isn't it ?
> How this header cache is going to resolve this serialization issue then ?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Monday, June 30, 2014 11:10 PM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; 
> ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Sushma,
>
> Thanks for your investigations! We already noticed the serializing risk on 
> GenericObjectMap/DBObjectMap. In order to improve performance we add header 
> cache to DBObjectMap.
>
> As for KeyValueStore, a cache branch is on the reviewing, it can greatly 
> reduce lookup_header calls. Of course, replace with RWLock is a good 
> suggestion, I would like to try to estimate!
>
> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <sushma.gur...@sandisk.com> 
> wrote:
>> Hi Haomai/Greg,
>>
>> I tried to analyze this a bit more and it appears that the 
>> GenericObjectMap::header_lock is serializing the READ requests in the 
>> following path and hence the low performance numbers with KeyValueStore.
>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() 
>> ->
>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>> GenericObjectMap::lookup_header()
>>
>> I fabricated the code to avoid this lock for a specific run and noticed that 
>> the performance is similar to FileStore.
>>
>> In our earlier investigations also we noticed similar serialization issues 
>> with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>
>> Can you please help understand the reason for this lock and whether it can 
>> be replaced with a RWLock or any other suggestions to avoid serialization 
>> due to this lock?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiw...@gmail.com]
>> Sent: Friday, June 27, 2014 1:08 AM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; 
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> As I mentioned days ago:
>>
>> There exists two points related kvstore perf:
>> 1. The order of image and the strip
>> size are important to performance. Because the header like inode in fs is 
>> much lightweight than fd, so the order of image is expected to be lower. And 
>> strip size can be configurated to 4kb to improve large io performance.
>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, 
>> the header cache is important to perf. It's just like fdcahce in FileStore.
>>
>> As for detail perf number, I think this result based on master branch is 
>> nearly correct. When strip-size and header cache are ready, I think it will 
>> be better.
>>
>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <sushma.gur...@sandisk.com> 
>> wrote:
>>> Delivery failure due to table format. Resending as plain text.
>>>
>>> _____________________________________________
>>> From: Sushma Gurram
>>> Sent: Thursday, June 26, 2014 5:35 PM
>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi Xinxin,
>>>
>>> Thanks for providing the results of the performance tests.
>>>
>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a 
>>> single OSD. Also confirmed with rados bench and both numbers seem to be of 
>>> the same order.
>>> My findings show that XFS is better than rocksdb. Can you please let us 
>>> know rocksdb configuration that you used, object size and duration of run 
>>> for rados bench?
>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer 
>>> (%CPU of this thread is 50, while that of all other threads in the OSD is 
>>> <10% utilized).
>>> Is there a ceph.conf config option to configure the background threads in 
>>> rocksdb?
>>>
>>> We ran our tests with following configuration:
>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical 
>>> cores), HT disabled, 16 GB memory
>>>
>>> rocksdb configuration has been set to the following values in ceph.conf.
>>>         rocksdb_write_buffer_size = 4194304
>>>         rocksdb_cache_size = 4194304
>>>         rocksdb_bloom_size = 0
>>>         rocksdb_max_open_files = 10240
>>>         rocksdb_compression = false
>>>         rocksdb_paranoid = false
>>>         rocksdb_log = /dev/null
>>>         rocksdb_compact_on_mount = false
>>>
>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, 
>>> iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) 
>>> client connections to the OSD, thus stressing the OSD.
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>> -------------------------------------------------------------------
>>> IO Pattern      XFS (IOPs)      Rocksdb (IOPs)
>>> 4K writes       ~1450           ~670
>>> 4K reads        ~65000          ~2000
>>> 64K writes      ~431            ~57
>>> 64K reads       ~17500          ~180
>>>
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>> -------------------------------------------------------------------
>>> IO Pattern      XFS (IOPs)      Rocksdb (IOPs)
>>> 4K writes       ~1450           ~962
>>> 4K reads        ~65000          ~1641
>>> 64K writes      ~431            ~426
>>> 64K reads       ~17500          ~209
>>>
>>> I guess theoretically lower rocksdb performance can be attributed to 
>>> compaction during writes and merging during reads, but I'm not sure if 
>>> READs are lower by this magnitude.
>>> However, your results seem to show otherwise. Can you please help us with 
>>> rockdb config and how the rados bench has been run?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin [mailto:xinxin....@intel.com]
>>> Sent: Sunday, June 22, 2014 6:18 PM
>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi all,
>>>
>>>  We enabled rocksdb as data store in our test setup (10 osds on two 
>>> servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) 
>>> Xeon(R) CPU E31280)  and have performance tests for xfs, leveldb and 
>>> rocksdb (use rados bench as our test tool),  the following chart shows 
>>> details, for write ,  with small number threads , leveldb performance is 
>>> lower than the other two backends , from 16 threads point ,  rocksdb 
>>> perform a little better than xfs and leveldb , leveldb and rocksdb perform 
>>> much better than xfs with higher thread number.
>>>
>>>                                                   xfs             leveldb   
>>>             rocksdb
>>>                           throughtput   latency     throughtput latency    
>>> throughtput  latency
>>> 1 thread write       84.029       0.048             52.430         0.076    
>>>           71.920    0.056
>>> 2 threads write      166.417      0.048             97.917         0.082    
>>>          155.148    0.052
>>> 4 threads write       304.099     0.052             156.094         0.102   
>>>          270.461    0.059
>>> 8 threads write       323.047     0.099             221.370         0.144   
>>>          339.455    0.094
>>> 16 threads write     295.040      0.216             272.032         0.235   
>>>     348.849 0.183
>>> 32 threads write     324.467      0.394             290.072          0.441  
>>>          338.103    0.378
>>> 64 threads write     313.713      0.812             293.261          0.871  
>>>    324.603  0.787
>>> 1 thread read         75.687      0.053              71.629          0.056  
>>>     72.526  0.055
>>> 2 threads read        182.329     0.044             151.683           0.053 
>>>     153.125 0.052
>>> 4 threads read        320.785     0.050             307.180           0.052 
>>>      312.016        0.051
>>> 8 threads read         504.880    0.063             512.295           0.062 
>>>      519.683        0.062
>>> 16 threads read       477.706     0.134             643.385           0.099 
>>>      654.149        0.098
>>> 32 threads read       517.670     0.247              666.696          0.192 
>>>      678.480        0.189
>>> 64 threads read       516.599     0.495              668.360           
>>> 0.383      680.673       0.376
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin
>>> Sent: Saturday, June 14, 2014 11:50 AM
>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Currently ceph will get stable rocksdb from branch 3.0.fb of  ceph/rocksdb  
>>> , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged ,  so 
>>> if you use 'git submodule update --init' to get rocksdb submodule , It did 
>>> not support autoconf/automake .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sushma Gurram
>>> Sent: Saturday, June 14, 2014 2:52 AM
>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory 
>>> seems to be empty. Do I need toput autoconf/automake in this directory?
>>> It doesn't seem to have any other source files and compilation fails:
>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or 
>>> directory compilation terminated.
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Shu, Xinxin
>>> Sent: Monday, June 09, 2014 10:00 PM
>>> To: Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi mark
>>>
>>> I have finished development of support of rocksdb submodule,  a pull 
>>> request for support of autoconf/automake for rocksdb has been created , you 
>>> can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok ,  I 
>>> will create a pull request for rocksdb submodule support , currently this 
>>> patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>> Hi sage ,
>>>> I will add two configure options to --with-librocksdb-static and 
>>>> --with-librocksdb , with --with-librocksdb-static option , ceph will 
>>>> compile the code that get from ceph repository , with  --with-librocksdb 
>>>> option ,  in case of distro packages for rocksdb , ceph will not compile 
>>>> the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>
>>>> since current rocksdb does not support autoconf&automake , I will add 
>>>> autoconf&automake support for rocksdb , but before that , i think we 
>>>> should fork a stable branch (maybe 3.0) for ceph .
>>>
>>> I'm looking at testing out the rocksdb support as well, both for the OSD 
>>> and for the monitor based on some issues we've been seeing lately.  Any 
>>> news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>
>>> Thanks,
>>> Mark
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mark.nel...@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>> Hi, sage
>>>>>
>>>>> I will add rocksdb submodule into the makefile , currently we want to 
>>>>> have fully performance tests on key-value db backend , both leveldb and 
>>>>> rocksdb. Then optimize on rocksdb performance.
>>>>
>>>> I'm definitely interested in any performance tests you do here.  Last 
>>>> winter I started doing some fairly high level tests on raw 
>>>> leveldb/hyperleveldb/raikleveldb.  I'm very interested in what you see 
>>>> with rocksdb as a backend.
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:s...@inktank.com]
>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>> To: Shu, Xinxin
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that 
>>>>> includes the latest set of patches with the groundwork and your rocksdb 
>>>>> patch.  There is also a commit that adds rocksdb as a git submodule.  I'm 
>>>>> thinking that, since there aren't any distro packages for rocksdb at this 
>>>>> point, this is going to be the easiest way to make this usable for people.
>>>>>
>>>>> If you can wire the submodule into the makefile, we can merge this in so 
>>>>> that rocksdb support is in the ceph.com packages on ceph.com.  I suspect 
>>>>> that the distros will prefer to turns this off in favor of separate 
>>>>> shared libs, but they can do this at their option if/when they include 
>>>>> rocksdb in the distro. I think the key is just to have both 
>>>>> --with-librockdb and --with-librocksdb-static (or similar) options so 
>>>>> that you can either use the static or dynamically linked one.
>>>>>
>>>>> Has your group done further testing with rocksdb?  Anything interesting 
>>>>> to share?
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majord...@vger.kernel.org More 
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any review, 
>>> dissemination, distribution, or copying of this message is strictly 
>>> prohibited. If you have received this communication in error, please notify 
>>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>>> any and all copies of this message in your possession (whether hard copies 
>>> or electronically stored copies).
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat

Reply via email to