Re: Radosgw - bucket index

Guang Yang Fri, 06 Jun 2014 05:51:14 -0700

Hi Yehuda,
Can you take a look at a very high level of the code change, here is the pull 
request - https://github.com/ceph/ceph/pull/1929.


If things look good to you, i will continue the effort and make it more 
clear/complete by end of next week.

Thanks,
Guang

On Jun 2, 2014, at 9:37 PM, Guang Yang <yguan...@outlook.com> wrote:

> Hi Yehuda and Sage,
> Can you help to comment on the ticket, I would like to send out a pull 
> request some time this week for you to review, but before that, it would be 
> nice to see your comments in terms of the interface and any other concerns 
> you may have for this. Thanks.
> 
> Thanks,
> Guang
> 
> 
> On May 30, 2014, at 8:35 AM, Guang Yang <yguan...@outlook.com> wrote:
> 
>> Hi Yehuda,
>> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to 
>> review and comment.
>> 
>> Thanks,
>> Guang
>> 
>> On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yeh...@inktank.com> wrote:
>> 
>>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguan...@outlook.com> wrote:
>>>> On May 19, 2014, at 7:05 AM, Sage Weil <s...@inktank.com> wrote:
>>>> 
>>>>> On Sun, 18 May 2014, Guang wrote:
>>>>>>>> radosgw is using the omap key/value API for objects, which is more or 
>>>>>>>> less
>>>>>>>> equivalent to what swift is doing with sqlite.  This data passes 
>>>>>>>> straight
>>>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>>>> crush map so that the rgw index pool is mapped to a different set of 
>>>>>>>> OSDs
>>>>>>>> with the better k/v backend.
>>>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>>>> implementation is that they are using a table for bucket index and it
>>>>>> actually can be updated in parallel which makes more scalable for write,
>>>>>> though at certain point the sql table would result in performance
>>>>>> degradation as well.
>>>>> 
>>>>> As I understand it the same limitation is present there too: the index is
>>>>> in a single sqlite table.
>>>>> 
>>>>>>> My more well-formed opinion is that we need to come up with a good
>>>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>>>> shrink), and I assume there would be some kind of background operation
>>>>>>> that will enable that. I also believe that making it hash based is the
>>>>>>> way to go. It looks like that the more complicated issue is here is
>>>>>>> how to handle the transition in which we shard buckets.
>>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>>>> shard at the very beginning (the problem we are facing is parallel
>>>>>> writes updating the same bucket index object will need to be
>>>>>> serialized).
>>>>> 
>>>>> Given how infrequent container listings are, pre-sharding containers
>>>>> across several objects makes some sense.  Paying the cost of doing
>>>>> listings in parallel across N (where N is not too big) is not a big price
>>>>> to pay. However, there will always need to be a way to re-shard further
>>>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>>>> be support for static sharding where the number of shards is specified at
>>>>> container/bucket creation time…
>>>> Considering the scope of the change, I also think this is a good starting 
>>>> point to make the bucket index updating more scalable.
>>>> Yehuda,
>>>> How do you think?
>>> 
>>> Sharding it will help with scaling it up to a certain point. As Sage
>>> mentioned we can start with a static setting as a first simpler
>>> approach, and move into a dynamic approach later on.
>>> 
>>> Yehuda
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Radosgw - bucket index

Reply via email to