I am pushing internally to open-source ZetaScale. Recent events may or may not 
affect that trajectory -- stay tuned.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-----Original Message-----
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: Wednesday, October 21, 2015 10:45 PM
To: Allen Samuels <allen.samu...@sandisk.com>; Ric Wheeler 
<rwhee...@redhat.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.
>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).
>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:

Has there been any discussion regarding opensourcing zetascale?

>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>    1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>    2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>    - We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's at least 3 IOs: one for the data, one for the 
>> fs journal, one for the kv txn to commit (at least once my rocksdb 
>> changes land... the kv commit is currently 2-3).  So two people are 
>> managing metadata, here: the fs managing the file metadata (with its 
>> own
>> journal) and the kv backend (with its journal).
>
> If all of the fsync()'s fall into the same backing file system, are you sure 
> that each fsync() takes the same time? Depending on the local FS 
> implementation of course, but the order of issuing those fsync()'s can 
> effectively make some of them no-ops.
>
>>
>>    - On read we have to open files by name, which means traversing 
>> the fs namespace.  Newstore tries to keep it as flat and simple as 
>> possible, but at a minimum it is a couple btree lookups.  We'd love 
>> to use open by handle (which would reduce this to 1 btree traversal), 
>> but running the daemon as ceph and not root makes that hard...
>
> This seems like a a pretty low hurdle to overcome.
>
>>
>>    - ...and file systems insist on updating mtime on writes, even 
>> when it is a overwrite with no allocation changes.  (We don't care 
>> about
>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the 
>> kernel brainfreeze.
>
> Are you using O_DIRECT? Seems like there should be some enterprisey database 
> tricks that we can use here.
>
>>
>>    - XFS is (probably) never going going to give us data checksums, 
>> which we want desperately.
>
> What is the goal of having the file system do the checksums? How strong do 
> they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each 
> write will possibly generate at least one other write to update that new 
> checksum).
>
>>
>> But what's the alternative?  My thought is to just bite the bullet 
>> and consume a raw block device directly.  Write an allocator, 
>> hopefully keep it pretty simple, and manage it in kv store along with 
>> all of our other metadata.
>
> The big problem with consuming block devices directly is that you ultimately 
> end up recreating most of the features that you had in the file system. Even 
> enterprise databases like Oracle and DB2 have been migrating away from 
> running on raw block devices in favor of file systems over time.  In effect, 
> you are looking at making a simple on disk file system which is always easier 
> to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time 
> working with the local file system people (XFS or other) to see if we can 
> jointly address the concerns you have.
>>
>> Wins:
>>
>>    - 2 IOs for most: one to write the data to unused space in the 
>> block device, one to commit our transaction (vs 4+ before).  For 
>> overwrites, we'd have one io to do our write-ahead log (kv journal), 
>> then do the overwrite async (vs 4+ before).
>>
>>    - No concern about mtime getting in the way
>>
>>    - Faster reads (no fs lookup)
>>
>>    - Similarly sized metadata for most objects.  If we assume most 
>> objects are not fragmented, then the metadata to store the block 
>> offsets is about the same size as the metadata to store the filenames we 
>> have now.
>>
>> Problems:
>>
>>    - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put 
>> metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs 
>> of rgw index data or cephfs metadata?  Suddenly we are pulling 
>> storage out of a different pool and those aren't currently fungible.
>>
>>    - We have to write and maintain an allocator.  I'm still 
>> optimistic this can be reasonbly simple, especially for the flash 
>> case (where fragmentation isn't such an issue as long as our blocks 
>> are reasonbly sized).  For disk we may beed to be moderately clever.
>>
>>    - We'll need a fsck to ensure our internal metadata is consistent.
>> The good news is it'll just need to validate what we have stored in 
>> the kv store.
>>
>> Other thoughts:
>>
>>    - We might want to consider whether dm-thin or bcache or other 
>> block layers might help us with elasticity of file vs block areas.
>>
>>    - Rocksdb can push colder data to a second directory, so we could 
>> have a fast ssd primary area (for wal and most metadata) and a second 
>> hdd directory for stuff it has to push off.  Then have a conservative 
>> amount of file space on the hdd.  If our block fills up, use the 
>> existing file mechanism to put data there too.  (But then we have to 
>> maintain both the current kv + file approach and not go all-in on kv 
>> +
>> block.)
>>
>> Thoughts?
>> sage
>> --
>
> I really hate the idea of making a new file system type (even if we call it a 
> raw block store!).
>
> In addition to the technical hurdles, there are also production worries like 
> how long will it take for distros to pick up formal support?  How do we test 
> it properly?
>
> Regards,
>
> Ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to