On Tue, 20 Oct 2015, James (Fei) Liu-SSI wrote: > Hi Sage, > Sorry for confusing you. SSDs with key value interfaces are still > under development by several vendors. It has totally different design > approach than Open Channel SSD. I met Matias several months ago and > discussed about possibilities to have key value interface support with > Open Channel SSD . I am not following the progress since then. If Matias > is in this group, He will definitely can give us better explanations. > Here is his presentation for key value support with open channel SSD for > your reference. > > http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf
Ok cool. I saw Matias' talk at Vault and was very pleased to see that there is some real effort to get away from black box FTLs. And I am eagerly awaiting the arrival of SSDs with a kv interface... open channel especially, but even proprietary devices exposing kv would be an improvement over proprietary devices exposing block. :) sage > > > Regards, > James > > -----Original Message----- > From: Sage Weil [mailto:sw...@redhat.com] > Sent: Tuesday, October 20, 2015 5:34 AM > To: James (Fei) Liu-SSI > Cc: Somnath Roy; ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote: > > Hi Sage and Somnath, > > In my humble opinion, There is another more aggressive solution than > > raw block device base keyvalue store as backend for objectstore. The > > new key value SSD device with transaction support would be ideal to > > solve the issues. First of all, it is raw SSD device. Secondly , It > > provides key value interface directly from SSD. Thirdly, it can > > provide transaction support, consistency will be guaranteed by hardware > > device. > > It pretty much satisfied all of objectstore needs without any extra > > overhead since there is not any extra layer in between device and > > objectstore. > > Are you talking about open channel SSDs? Or something else? Everything I'm > familiar with that is currently shipping is exposing a vanilla block > interface (conventional SSDs) that hides all of that or NVMe (which isn't > much better). > > If there is a low-level KV interface we can consume that would be > great--especially if we can glue it to our KeyValueDB abstract API. Even so, > we need to make sure that the object *data* also has an efficient API we can > utilize that efficiently handles block-sized/aligned data. > > sage > > > > Either way, I strongly support to have CEPH own data format instead > > of relying on filesystem. > > > > Regards, > > James > > > > -----Original Message----- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 1:55 PM > > To: Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > > Sage, > > > I fully support that. If we want to saturate SSDs , we need to get > > > rid of this filesystem overhead (which I am in process of measuring). > > > Also, it will be good if we can eliminate the dependency on the k/v > > > dbs (for storing allocators and all). The reason is the unknown > > > write amps they causes. > > > > My hope is to keep behing the KeyValueDB interface (and/more change it > > as > > appropriate) so that other backends can be easily swapped in (e.g. a > > btree-based one for high-end flash). > > > > sage > > > > > > > > > > Thanks & Regards > > > Somnath > > > > > > > > > -----Original Message----- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > > Sent: Monday, October 19, 2015 12:49 PM > > > To: ceph-devel@vger.kernel.org > > > Subject: newstore direction > > > > > > The current design is based on two simple ideas: > > > > > > 1) a key/value interface is better way to manage all of our > > > internal metadata (object metadata, attrs, layout, collection > > > membership, write-ahead logging, overlay data, etc.) > > > > > > 2) a file system is well suited for storage object data (as files). > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. > > > A few > > > things: > > > > > > - We currently write the data to the file, fsync, then commit the > > > kv transaction. That's at least 3 IOs: one for the data, one for > > > the fs journal, one for the kv txn to commit (at least once my > > > rocksdb changes land... the kv commit is currently 2-3). So two > > > people are managing metadata, here: the fs managing the file > > > metadata (with its own > > > journal) and the kv backend (with its journal). > > > > > > - On read we have to open files by name, which means traversing the fs > > > namespace. Newstore tries to keep it as flat and simple as possible, but > > > at a minimum it is a couple btree lookups. We'd love to use open by > > > handle (which would reduce this to 1 btree traversal), but running the > > > daemon as ceph and not root makes that hard... > > > > > > - ...and file systems insist on updating mtime on writes, even when it > > > is a overwrite with no allocation changes. (We don't care about mtime.) > > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > > brainfreeze. > > > > > > - XFS is (probably) never going going to give us data checksums, which > > > we want desperately. > > > > > > But what's the alternative? My thought is to just bite the bullet and > > > consume a raw block device directly. Write an allocator, hopefully keep > > > it pretty simple, and manage it in kv store along with all of our other > > > metadata. > > > > > > Wins: > > > > > > - 2 IOs for most: one to write the data to unused space in the block > > > device, one to commit our transaction (vs 4+ before). For overwrites, > > > we'd have one io to do our write-ahead log (kv journal), then do the > > > overwrite async (vs 4+ before). > > > > > > - No concern about mtime getting in the way > > > > > > - Faster reads (no fs lookup) > > > > > > - Similarly sized metadata for most objects. If we assume most objects > > > are not fragmented, then the metadata to store the block offsets is about > > > the same size as the metadata to store the filenames we have now. > > > > > > Problems: > > > > > > - We have to size the kv backend storage (probably still an XFS > > > partition) vs the block storage. Maybe we do this anyway (put > > > metadata on > > > SSD!) so it won't matter. But what happens when we are storing gobs of > > > rgw index data or cephfs metadata? Suddenly we are pulling storage out > > > of a different pool and those aren't currently fungible. > > > > > > - We have to write and maintain an allocator. I'm still optimistic this > > > can be reasonbly simple, especially for the flash case (where > > > fragmentation isn't such an issue as long as our blocks are reasonbly > > > sized). For disk we may beed to be moderately clever. > > > > > > - We'll need a fsck to ensure our internal metadata is consistent. The > > > good news is it'll just need to validate what we have stored in the kv > > > store. > > > > > > Other thoughts: > > > > > > - We might want to consider whether dm-thin or bcache or other block > > > layers might help us with elasticity of file vs block areas. > > > > > > - Rocksdb can push colder data to a second directory, so we could > > > have a fast ssd primary area (for wal and most metadata) and a > > > second hdd directory for stuff it has to push off. Then have a > > > conservative amount of file space on the hdd. If our block fills > > > up, use the existing file mechanism to put data there too. (But > > > then we have to maintain both the current kv + file approach and not > > > go all-in on kv + > > > block.) > > > > > > Thoughts? > > > sage > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majord...@vger.kernel.org More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > ________________________________ > > > > > > PLEASE NOTE: The information contained in this electronic mail message is > > > intended only for the use of the designated recipient(s) named above. If > > > the reader of this message is not the intended recipient, you are hereby > > > notified that you have received this message in error and that any > > > review, dissemination, distribution, or copying of this message is > > > strictly prohibited. If you have received this communication in error, > > > please notify the sender by telephone or e-mail (as shown above) > > > immediately and destroy any and all copies of this message in your > > > possession (whether hard copies or electronically stored copies). > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majord...@vger.kernel.org More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majord...@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majord...@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html