Bonjour,

TL;DR: Is it more advisable to work on Ceph internals to make it friendly to 
this particular workload or write something similar to EOS[0] (i.e Rocksdb + 
Xrootd + RBD)?

This is a followup of two previous mails[1] sent while researching this topic. 
In a nutshell, the Software Heritage project[1] currently has ~750TB and 10 
billions objects, 75% of which have a size smaller than 16KB and 50% have a 
size smaller than 4KB. But they only account for ~5% of the 750TB: 25% of the 
objects have a size > 16KB and total ~700TB. The objects can be compressed by 
~50% and 750TB only needs 350TB of actual storage. (if you're interested in the 
details see [2]).

Let say those 10 billions objects are stored in a single 4+2 erasure coded pool 
with bluestore compression set for objects that have a size > 32KB and the 
smallest allocation size for bluestore set to 4KB[3]. The 750TB won't use the 
expected 350TB but about 30% more, i.e. ~450TB (see [4] for the maths). This 
space amplification is because storing a 1 byte object uses the same space as 
storing a 16KB object (see [5] to repeat the experience at home). In a 4+2 
erasure coded pool, each of the 6 chunks will use no less than 4KB because 
that's the smallest allocation size for bluestore. That's 4 * 4KB = 16KB even 
when all that is needed is 1 byte.

It was suggested[6] to have two different pools: one with a 4+2 erasure pool 
and compression for all objects with a size > 32KB that are expected to 
compress to 16KB. And another with 3 replicas for the smaller objects to reduce 
space amplification to a minimum without compromising on durability. A client 
looking for the object could make two simultaneous requests to the two pools. 
They would get 404 from one of them and the object from the other.

Another workaround, is best described in the "Finding a needle in Haystack: 
Facebook’s photo storage"[9] paper and essentially boils down to using a 
database to store a map between the object name and its location. That does not 
scale out (writing the database index is the bottleneck) but it's simple enough 
and is successfully implemented in EOS[0] with >200PB worth of data and in 
seaweedfs[10], another promising object store software based on the same idea.

Instead of working around the problem, maybe Ceph could be modified to make 
better use of the immutability of these objects[7], a hint that is apparently 
only used to figure out how to best compress it and for checksum 
calculation[8]. I honestly have not clue how difficult it would be. All I know 
is that it's not easy otherwise it would have been done already: there seem to 
be a general need for efficiently (space wise and performance wise) storing 
large quantities of objects smaller than 4KB.

Is it more advisable to:

  * work on Ceph internals to make it friendly to this particular workload or,
  * write another implementation of "Finding a needle in Haystack: Facebook’s 
photo storage"[9] based on RBD[11]?

I'm currently leaning toward working on Ceph internals but there are pros and 
cons to both approaches[12]. And since all this is still very new to me, there 
also is the possibility that I'm missing something. Maybe it's *super* 
difficult  to improve Ceph in this way. I should try to figure that out sooner 
rather than later.

I realize it's a lot to take in and unless you're facing the exact same problem 
there is very little chance you read that far :-) But if you did... I'm 
*really* interested to hear what yout think. In any case I'll report back to 
this thread once a decision has been made.

Cheers

[0] https://eos-web.web.cern.ch/eos-web/
[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AEMW6O7WVJFMUIX7QGI2KM7HKDSTNIYT/
 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RHQ5ZCHJISXIXOJSH3TU7DLYVYHRGTAT/
[2] https://forge.softwareheritage.org/T3054
[3] 
https://github.com/ceph/ceph/blob/3f5e778ad6f055296022e8edabf701b6958fb602/src/common/options.cc#L4326-L4330
[4] https://forge.softwareheritage.org/T3052#58864
[5] https://forge.softwareheritage.org/T3052#58917
[6] https://forge.softwareheritage.org/T3052#58876
[7] 
https://docs.ceph.com/en/latest/rados/api/librados/#c.@3.LIBRADOS_ALLOC_HINT_FLAG_IMMUTABLE
[8] https://forge.softwareheritage.org/T3055
[9] https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
[10] https://github.com/chrislusf/seaweedfs/wiki/Components
[11] https://forge.softwareheritage.org/T3049
[12] https://forge.softwareheritage.org/T3054#58977

-- 
Loïc Dachary, Artisan Logiciel Libre


Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to