On 16 Jun 2017, at 10:51, Clint Byrum wrote:
> This is great work. > > I'm sure you've already thought of this, but could you explain why > you've chosen not to put the small objects in the k/v store as part of > the value rather than in secondary large files? I don't want to co-opt an answer from Alex, but I do want to point to some of the other background on this LOSF work. https://wiki.openstack.org/wiki/Swift/ideas/small_files https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation Look at the second link for some context to your answer, but the summary is "that means writing a file system, and writing a file system is really hard". --John > > Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200: >> Swift stores objects on a regular filesystem (XFS is recommended), one file >> per object. While it works fine for medium or big objects, when you have >> lots of small objects you can run into issues: because of the high count of >> inodes on the object servers, they can’t stay in cache, implying lot of >> memory usage and IO operations to fetch inodes from disk. >> >> In the past few months, we’ve been working on implementing a new storage >> backend in Swift. It is highly inspired by haystack[1]. In a few words, >> objects are stored in big files, and a Key/Value store provides information >> to locate an object (object hash -> big_file_id:offset). As the mapping in >> the K/V consumes less memory than an inode, it is possible to keep all >> entries in memory, saving many IO to locate the object. It also allows some >> performance improvements by limiting the XFS meta updates (e.g.: almost no >> inode updates as we write objects by using fdatasync() instead of fsync()) >> >> One of the questions that was raised during discussions about this design >> is: do we want one K/V store per device, or one K/V store per Swift >> partition (= multiple K/V per device). The concern was about failure domain. >> If the only K/V gets corrupted, the whole device must be reconstructed. >> Memory usage is a major point in making a decision, so we did some benchmark. >> >> The key-value store is implemented over LevelDB. >> Given a single disk with 20 million files (could be either one object >> replica or one fragment, if using EC) >> >> I have tested three cases : >> - single KV for the whole disk >> - one KV per partition, with 100 partitions per disk >> - one KV per partition, with 1000 partitions per disk >> >> Single KV for the disk : >> - DB size: 750 MB >> - bytes per object: 38 >> >> One KV per partition : >> Assuming : >> - 100 partitions on the disk (=> 100 KV) >> - 16 bits part power (=> all keys in a given KV will have the same 16 bit >> prefix) >> >> - 7916 KB per KV, total DB size: 773 MB >> - bytes per object: 41 >> >> One KV per partition : >> Assuming : >> - 1000 partitions on the disk (=> 1000 KV) >> - 16 bits part power (=> all keys in a given KV will have the same 16 bit >> prefix) >> >> - 1388 KB per KV, total DB size: 1355 MB total >> - bytes per object: 71 >> >> >> A typical server we use for swift clusters has 36 drives, which gives us : >> - Single KV : 26 GB >> - Split KV, 100 partitions : 28 GB (+7%) >> - Split KV, 1000 partitions : 48 GB (+85%) >> >> So, splitting seems reasonable if you don't have too many partitions. >> >> Same test, with 10 million files instead of 20 >> >> - Single KV : 13 GB >> - Split KV, 100 partitions : 18 GB (+38%) >> - Split KV, 1000 partitions : 24 GB (+85%) >> >> >> Finally, if we run a full compaction on the DB after the test, you get the >> same memory usage in all cases, about 32 bytes per object. >> >> We have not made enough tests to know what would happen in production. >> LevelDB >> does trigger compaction automatically on parts of the DB, but continuous >> change >> means we probably would not reach the smallest possible size. >> >> >> Beyond the size issue, there are other things to consider : >> File descriptors limits : LevelDB seems to keep at least 4 file descriptors >> open during operation. >> >> Having one KV per partition also means you have to move entries between KVs >> when you change the part power. (if we want to support that) >> >> A compromise may be to split KVs on a small prefix of the object's hash, >> independent of swift's configuration. >> >> As you can see we're still thinking about this. Any ideas are welcome ! >> We will keep you updated about more "real world" testing. Among the tests we >> plan to check how resilient the DB is in case of a power loss. >> > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
signature.asc
Description: OpenPGP digital signature
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev