Bonjour, For the record, here is a summary of the key takeaways from this conversation (so far):
* Ambry[0] is a perfect match and I'll keep exploring it[1]. * To keep billions of small objects manageable, they must be packed together. * Immutable & never deleted objects can be grouped together for the purpose of packing them without a central database. For this to work the id of the group to which an object belongs is included in the object ID (e.g. SHA256 + UUID of the group). That's what Ambry does. To be continued. [0] https://github.com/linkedin/ambry/wiki [1] https://forge.softwareheritage.org/T3064 On 17/02/2021 17:36, Loïc Dachary wrote: > Bonjour, > > TL;DR: Is it more advisable to work on Ceph internals to make it friendly to > this particular workload or write something similar to EOS[0] (i.e Rocksdb + > Xrootd + RBD)? > > This is a followup of two previous mails[1] sent while researching this > topic. In a nutshell, the Software Heritage project[1] currently has ~750TB > and 10 billions objects, 75% of which have a size smaller than 16KB and 50% > have a size smaller than 4KB. But they only account for ~5% of the 750TB: 25% > of the objects have a size > 16KB and total ~700TB. The objects can be > compressed by ~50% and 750TB only needs 350TB of actual storage. (if you're > interested in the details see [2]). > > Let say those 10 billions objects are stored in a single 4+2 erasure coded > pool with bluestore compression set for objects that have a size > 32KB and > the smallest allocation size for bluestore set to 4KB[3]. The 750TB won't use > the expected 350TB but about 30% more, i.e. ~450TB (see [4] for the maths). > This space amplification is because storing a 1 byte object uses the same > space as storing a 16KB object (see [5] to repeat the experience at home). In > a 4+2 erasure coded pool, each of the 6 chunks will use no less than 4KB > because that's the smallest allocation size for bluestore. That's 4 * 4KB = > 16KB even when all that is needed is 1 byte. > > It was suggested[6] to have two different pools: one with a 4+2 erasure pool > and compression for all objects with a size > 32KB that are expected to > compress to 16KB. And another with 3 replicas for the smaller objects to > reduce space amplification to a minimum without compromising on durability. A > client looking for the object could make two simultaneous requests to the two > pools. They would get 404 from one of them and the object from the other. > > Another workaround, is best described in the "Finding a needle in Haystack: > Facebook’s photo storage"[9] paper and essentially boils down to using a > database to store a map between the object name and its location. That does > not scale out (writing the database index is the bottleneck) but it's simple > enough and is successfully implemented in EOS[0] with >200PB worth of data > and in seaweedfs[10], another promising object store software based on the > same idea. > > Instead of working around the problem, maybe Ceph could be modified to make > better use of the immutability of these objects[7], a hint that is apparently > only used to figure out how to best compress it and for checksum > calculation[8]. I honestly have not clue how difficult it would be. All I > know is that it's not easy otherwise it would have been done already: there > seem to be a general need for efficiently (space wise and performance wise) > storing large quantities of objects smaller than 4KB. > > Is it more advisable to: > > * work on Ceph internals to make it friendly to this particular workload or, > * write another implementation of "Finding a needle in Haystack: Facebook’s > photo storage"[9] based on RBD[11]? > > I'm currently leaning toward working on Ceph internals but there are pros and > cons to both approaches[12]. And since all this is still very new to me, > there also is the possibility that I'm missing something. Maybe it's *super* > difficult to improve Ceph in this way. I should try to figure that out > sooner rather than later. > > I realize it's a lot to take in and unless you're facing the exact same > problem there is very little chance you read that far :-) But if you did... > I'm *really* interested to hear what yout think. In any case I'll report back > to this thread once a decision has been made. > > Cheers > > [0] https://eos-web.web.cern.ch/eos-web/ > [1] > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/ > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/ > [2] https://forge.softwareheritage.org/T3054 > [3] > https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330 > [4] https://forge.softwareheritage.org/T3052#58864 > [5] https://forge.softwareheritage.org/T3052#58917 > [6] https://forge.softwareheritage.org/T3052#58876 > [7] > https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE > [8] https://forge.softwareheritage.org/T3055 > [9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf > [10] https://github.com/chrislusf/seaweedfs/wiki/Components > [11] https://forge.softwareheritage.org/T3049 > [12] https://forge.softwareheritage.org/T3054#58977 > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io