RE: newstore direction

Allen Samuels Wed, 21 Oct 2015 17:54:07 -0700

Fixing the bug doesn't take a long time. Getting it deployed is where the delay 
is. Many companies standardize on a particular release of a particular distro. 
Getting them to switch to a new release -- even a "bug fix" point release -- is 
a major undertaking that often is a complete roadblock. Just my experience. 
YMMV.



Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-----Original Message-----
From: Ric Wheeler [mailto:rwhee...@redhat.com] 
Sent: Wednesday, October 21, 2015 8:24 PM
To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction



On 10/21/2015 06:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.

I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.

We need some deep analysis with some local file system experts thrown in to 
validate the concerns.

>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).

Not clear what bugs you are thinking of or why you think fixing bugs will take 
a long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.

Never seen a "bug" take a couple of years to hit users.

Regards,

Ric

>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:
>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>    1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>    2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>    - We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's at least 3 IOs: one for the data, one for the 
>> fs journal, one for the kv txn to commit (at least once my rocksdb 
>> changes land... the kv commit is currently 2-3).  So two people are 
>> managing metadata, here: the fs managing the file metadata (with its 
>> own
>> journal) and the kv backend (with its journal).
> If all of the fsync()'s fall into the same backing file system, are you sure 
> that each fsync() takes the same time? Depending on the local FS 
> implementation of course, but the order of issuing those fsync()'s can 
> effectively make some of them no-ops.
>
>>    - On read we have to open files by name, which means traversing 
>> the fs namespace.  Newstore tries to keep it as flat and simple as 
>> possible, but at a minimum it is a couple btree lookups.  We'd love 
>> to use open by handle (which would reduce this to 1 btree traversal), 
>> but running the daemon as ceph and not root makes that hard...
> This seems like a a pretty low hurdle to overcome.
>
>>    - ...and file systems insist on updating mtime on writes, even 
>> when it is a overwrite with no allocation changes.  (We don't care 
>> about
>> mtime.) O_NOCMTIME patches exist but it is hard to get these past the 
>> kernel brainfreeze.
> Are you using O_DIRECT? Seems like there should be some enterprisey database 
> tricks that we can use here.
>
>>    - XFS is (probably) never going going to give us data checksums, 
>> which we want desperately.
> What is the goal of having the file system do the checksums? How strong do 
> they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each 
> write will possibly generate at least one other write to update that new 
> checksum).
>
>> But what's the alternative?  My thought is to just bite the bullet 
>> and consume a raw block device directly.  Write an allocator, 
>> hopefully keep it pretty simple, and manage it in kv store along with 
>> all of our other metadata.
> The big problem with consuming block devices directly is that you ultimately 
> end up recreating most of the features that you had in the file system. Even 
> enterprise databases like Oracle and DB2 have been migrating away from 
> running on raw block devices in favor of file systems over time.  In effect, 
> you are looking at making a simple on disk file system which is always easier 
> to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time 
> working with the local file system people (XFS or other) to see if we can 
> jointly address the concerns you have.
>> Wins:
>>
>>    - 2 IOs for most: one to write the data to unused space in the 
>> block device, one to commit our transaction (vs 4+ before).  For 
>> overwrites, we'd have one io to do our write-ahead log (kv journal), 
>> then do the overwrite async (vs 4+ before).
>>
>>    - No concern about mtime getting in the way
>>
>>    - Faster reads (no fs lookup)
>>
>>    - Similarly sized metadata for most objects.  If we assume most 
>> objects are not fragmented, then the metadata to store the block 
>> offsets is about the same size as the metadata to store the filenames we 
>> have now.
>>
>> Problems:
>>
>>    - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put 
>> metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs 
>> of rgw index data or cephfs metadata?  Suddenly we are pulling 
>> storage out of a different pool and those aren't currently fungible.
>>
>>    - We have to write and maintain an allocator.  I'm still 
>> optimistic this can be reasonbly simple, especially for the flash 
>> case (where fragmentation isn't such an issue as long as our blocks 
>> are reasonbly sized).  For disk we may beed to be moderately clever.
>>
>>    - We'll need a fsck to ensure our internal metadata is consistent.
>> The good news is it'll just need to validate what we have stored in 
>> the kv store.
>>
>> Other thoughts:
>>
>>    - We might want to consider whether dm-thin or bcache or other 
>> block layers might help us with elasticity of file vs block areas.
>>
>>    - Rocksdb can push colder data to a second directory, so we could 
>> have a fast ssd primary area (for wal and most metadata) and a second 
>> hdd directory for stuff it has to push off.  Then have a conservative 
>> amount of file space on the hdd.  If our block fills up, use the 
>> existing file mechanism to put data there too.  (But then we have to 
>> maintain both the current kv + file approach and not go all-in on kv 
>> +
>> block.)
>>
>> Thoughts?
>> sage
>> --
> I really hate the idea of making a new file system type (even if we call it a 
> raw block store!).
>
> In addition to the technical hurdles, there are also production worries like 
> how long will it take for distros to pick up formal support?  How do we test 
> it properly?
>
> Regards,
>
> Ric
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: newstore direction

Reply via email to