Hi Yehuda, Can you take a look at a very high level of the code change, here is the pull request - https://github.com/ceph/ceph/pull/1929.
If things look good to you, i will continue the effort and make it more clear/complete by end of next week. Thanks, Guang On Jun 2, 2014, at 9:37 PM, Guang Yang <yguan...@outlook.com> wrote: > Hi Yehuda and Sage, > Can you help to comment on the ticket, I would like to send out a pull > request some time this week for you to review, but before that, it would be > nice to see your comments in terms of the interface and any other concerns > you may have for this. Thanks. > > Thanks, > Guang > > > On May 30, 2014, at 8:35 AM, Guang Yang <yguan...@outlook.com> wrote: > >> Hi Yehuda, >> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to >> review and comment. >> >> Thanks, >> Guang >> >> On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yeh...@inktank.com> wrote: >> >>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguan...@outlook.com> wrote: >>>> On May 19, 2014, at 7:05 AM, Sage Weil <s...@inktank.com> wrote: >>>> >>>>> On Sun, 18 May 2014, Guang wrote: >>>>>>>> radosgw is using the omap key/value API for objects, which is more or >>>>>>>> less >>>>>>>> equivalent to what swift is doing with sqlite. This data passes >>>>>>>> straight >>>>>>>> into leveldb on the backend (or whatever other backend you are using). >>>>>>>> Using something like rocksdb in its place is pretty simple and ther are >>>>>>>> unmerged patches to do that; the user would just need to adjust their >>>>>>>> crush map so that the rgw index pool is mapped to a different set of >>>>>>>> OSDs >>>>>>>> with the better k/v backend. >>>>>> Not sure if I miss anything, but the key difference with SWIFT?s >>>>>> implementation is that they are using a table for bucket index and it >>>>>> actually can be updated in parallel which makes more scalable for write, >>>>>> though at certain point the sql table would result in performance >>>>>> degradation as well. >>>>> >>>>> As I understand it the same limitation is present there too: the index is >>>>> in a single sqlite table. >>>>> >>>>>>> My more well-formed opinion is that we need to come up with a good >>>>>>> design. It needs to be flexible enough to be able to grow (and maybe >>>>>>> shrink), and I assume there would be some kind of background operation >>>>>>> that will enable that. I also believe that making it hash based is the >>>>>>> way to go. It looks like that the more complicated issue is here is >>>>>>> how to handle the transition in which we shard buckets. >>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted >>>>>> list (so that it enable prefix scan for listing purpose) and we want to >>>>>> shard at the very beginning (the problem we are facing is parallel >>>>>> writes updating the same bucket index object will need to be >>>>>> serialized). >>>>> >>>>> Given how infrequent container listings are, pre-sharding containers >>>>> across several objects makes some sense. Paying the cost of doing >>>>> listings in parallel across N (where N is not too big) is not a big price >>>>> to pay. However, there will always need to be a way to re-shard further >>>>> when containers/buckets get extremely big. Perhaps a starting point would >>>>> be support for static sharding where the number of shards is specified at >>>>> container/bucket creation timeā¦ >>>> Considering the scope of the change, I also think this is a good starting >>>> point to make the bucket index updating more scalable. >>>> Yehuda, >>>> How do you think? >>> >>> Sharding it will help with scaling it up to a certain point. As Sage >>> mentioned we can start with a static setting as a first simpler >>> approach, and move into a dynamic approach later on. >>> >>> Yehuda >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html