On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get rid 
> of this filesystem overhead (which I am in process of measuring). Also, 
> it will be good if we can eliminate the dependency on the k/v dbs (for 
> storing allocators and all). The reason is the unknown write amps they 
> causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as 
appropriate) so that other backends can be easily swapped in (e.g. a 
btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, write-ahead 
> logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but at 
> a minimum it is a couple btree lookups.  We'd love to use open by handle 
> (which would reduce this to 1 btree traversal), but running the daemon as 
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a 
> overwrite with no allocation changes.  (We don't care about mtime.) 
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep it 
> pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, 
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one 
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are 
> not fragmented, then the metadata to store the block offsets is about the 
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
> index data or cephfs metadata?  Suddenly we are pulling storage out of a 
> different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be reasonbly simple, especially for the flash case (where fragmentation isn't 
> such an issue as long as our blocks are reasonbly sized).  For disk we may 
> beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The good 
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block layers 
> might help us with elasticity of file vs block areas.
> 
>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd directory 
> for stuff it has to push off.  Then have a conservative amount of file space 
> on the hdd.  If our block fills up, use the existing file mechanism to put 
> data there too.  (But then we have to maintain both the current kv + file 
> approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to