On Tue, Oct 20, 2015 at 6:19 AM, Mark Nelson <mnel...@redhat.com> wrote: > On 10/20/2015 07:30 AM, Sage Weil wrote: >> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: >>> >>> +1, nowadays K-V DB care more about very small key-value pairs, say >>> several bytes to a few KB, but in SSD case we only care about 4KB or >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD >>> vendor are also trying to build this kind of interface, we had a NVM-L >>> library but still under development. >> >> >> Do you have an NVMKV link? I see a paper and a stale github repo.. not >> sure if I'm looking at the right thing. >> >> My concern with using a key/value interface for the object data is that >> you end up with lots of key/value pairs (e.g., $inode_$offset = >> $4kb_of_data) that is pretty inefficient to store and (depending on the >> implementation) tends to break alignment. I don't think these interfaces >> are targetted toward block-sized/aligned payloads. Storing just the >> metadata (block allocation map) w/ the kv api and storing the data >> directly on a block/page interface makes more sense to me. >> >> sage > > > I get the feeling that some of the folks that were involved with nvmkv at > Fusion IO have left. Nisha Talagala is now out at Parallel Systems for > instance. http://pmem.io might be a better bet, though I haven't looked > closely at it. >
IMO pmem.io is more suited for SCM (Storage Class Memory) than for SSD's. If Newstore is target towards production deployments (Eventually replacing FileStore someday) then IMO I agree with sage, i.e. rely on a file system for doing block allocation. -Neo > Mark > > >> >> >>>> -----Original Message----- >>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI >>>> Sent: Tuesday, October 20, 2015 6:21 AM >>>> To: Sage Weil; Somnath Roy >>>> Cc: ceph-devel@vger.kernel.org >>>> Subject: RE: newstore direction >>>> >>>> Hi Sage and Somnath, >>>> In my humble opinion, There is another more aggressive solution than >>>> raw >>>> block device base keyvalue store as backend for objectstore. The new key >>>> value SSD device with transaction support would be ideal to solve the >>>> issues. >>>> First of all, it is raw SSD device. Secondly , It provides key value >>>> interface >>>> directly from SSD. Thirdly, it can provide transaction support, >>>> consistency will >>>> be guaranteed by hardware device. It pretty much satisfied all of >>>> objectstore >>>> needs without any extra overhead since there is not any extra layer in >>>> between device and objectstore. >>>> Either way, I strongly support to have CEPH own data format instead >>>> of >>>> relying on filesystem. >>>> >>>> Regards, >>>> James >>>> >>>> -----Original Message----- >>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>>> ow...@vger.kernel.org] On Behalf Of Sage Weil >>>> Sent: Monday, October 19, 2015 1:55 PM >>>> To: Somnath Roy >>>> Cc: ceph-devel@vger.kernel.org >>>> Subject: RE: newstore direction >>>> >>>> On Mon, 19 Oct 2015, Somnath Roy wrote: >>>>> >>>>> Sage, >>>>> I fully support that. If we want to saturate SSDs , we need to get >>>>> rid of this filesystem overhead (which I am in process of measuring). >>>>> Also, it will be good if we can eliminate the dependency on the k/v >>>>> dbs (for storing allocators and all). The reason is the unknown write >>>>> amps they causes. >>>> >>>> >>>> My hope is to keep behing the KeyValueDB interface (and/more change it >>>> as >>>> appropriate) so that other backends can be easily swapped in (e.g. a >>>> btree- >>>> based one for high-end flash). >>>> >>>> sage >>>> >>>> >>>>> >>>>> Thanks & Regards >>>>> Somnath >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: ceph-devel-ow...@vger.kernel.org >>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil >>>>> Sent: Monday, October 19, 2015 12:49 PM >>>>> To: ceph-devel@vger.kernel.org >>>>> Subject: newstore direction >>>>> >>>>> The current design is based on two simple ideas: >>>>> >>>>> 1) a key/value interface is better way to manage all of our internal >>>>> metadata (object metadata, attrs, layout, collection membership, >>>>> write-ahead logging, overlay data, etc.) >>>>> >>>>> 2) a file system is well suited for storage object data (as files). >>>>> >>>>> So far 1 is working out well, but I'm questioning the wisdom of #2. A >>>>> few >>>>> things: >>>>> >>>>> - We currently write the data to the file, fsync, then commit the kv >>>>> transaction. That's at least 3 IOs: one for the data, one for the fs >>>>> journal, one for the kv txn to commit (at least once my rocksdb >>>>> changes land... the kv commit is currently 2-3). So two people are >>>>> managing metadata, here: the fs managing the file metadata (with its >>>>> own >>>>> journal) and the kv backend (with its journal). >>>>> >>>>> - On read we have to open files by name, which means traversing the >>>>> fs >>>> >>>> namespace. Newstore tries to keep it as flat and simple as possible, >>>> but at a >>>> minimum it is a couple btree lookups. We'd love to use open by handle >>>> (which would reduce this to 1 btree traversal), but running the daemon >>>> as >>>> ceph and not root makes that hard... >>>>> >>>>> >>>>> - ...and file systems insist on updating mtime on writes, even when >>>>> it is a >>>> >>>> overwrite with no allocation changes. (We don't care about mtime.) >>>> O_NOCMTIME patches exist but it is hard to get these past the kernel >>>> brainfreeze. >>>>> >>>>> >>>>> - XFS is (probably) never going going to give us data checksums, >>>>> which we >>>> >>>> want desperately. >>>>> >>>>> >>>>> But what's the alternative? My thought is to just bite the bullet and >>>> >>>> consume a raw block device directly. Write an allocator, hopefully keep >>>> it >>>> pretty simple, and manage it in kv store along with all of our other >>>> metadata. >>>>> >>>>> >>>>> Wins: >>>>> >>>>> - 2 IOs for most: one to write the data to unused space in the block >>>>> device, >>>> >>>> one to commit our transaction (vs 4+ before). For overwrites, we'd have >>>> one >>>> io to do our write-ahead log (kv journal), then do the overwrite async >>>> (vs 4+ >>>> before). >>>>> >>>>> >>>>> - No concern about mtime getting in the way >>>>> >>>>> - Faster reads (no fs lookup) >>>>> >>>>> - Similarly sized metadata for most objects. If we assume most >>>>> objects are >>>> >>>> not fragmented, then the metadata to store the block offsets is about >>>> the >>>> same size as the metadata to store the filenames we have now. >>>>> >>>>> >>>>> Problems: >>>>> >>>>> - We have to size the kv backend storage (probably still an XFS >>>>> partition) vs the block storage. Maybe we do this anyway (put >>>>> metadata on >>>>> SSD!) so it won't matter. But what happens when we are storing gobs of >>>> >>>> rgw index data or cephfs metadata? Suddenly we are pulling storage out >>>> of a >>>> different pool and those aren't currently fungible. >>>>> >>>>> >>>>> - We have to write and maintain an allocator. I'm still optimistic >>>>> this can be >>>> >>>> reasonbly simple, especially for the flash case (where fragmentation >>>> isn't >>>> such an issue as long as our blocks are reasonbly sized). For disk we >>>> may >>>> beed to be moderately clever. >>>>> >>>>> >>>>> - We'll need a fsck to ensure our internal metadata is consistent. >>>>> The good >>>> >>>> news is it'll just need to validate what we have stored in the kv store. >>>>> >>>>> >>>>> Other thoughts: >>>>> >>>>> - We might want to consider whether dm-thin or bcache or other block >>>> >>>> layers might help us with elasticity of file vs block areas. >>>>> >>>>> >>>>> - Rocksdb can push colder data to a second directory, so we could >>>>> have a fast ssd primary area (for wal and most metadata) and a second >>>>> hdd directory for stuff it has to push off. Then have a conservative >>>>> amount of file space on the hdd. If our block fills up, use the >>>>> existing file mechanism to put data there too. (But then we have to >>>>> maintain both the current kv + file approach and not go all-in on kv + >>>>> block.) >>>>> >>>>> Thoughts? >>>>> sage >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in the body of a message to majord...@vger.kernel.org More >>>> >>>> majordomo >>>>> >>>>> info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> ________________________________ >>>>> >>>>> PLEASE NOTE: The information contained in this electronic mail message >>>>> is >>>> >>>> intended only for the use of the designated recipient(s) named above. If >>>> the >>>> reader of this message is not the intended recipient, you are hereby >>>> notified >>>> that you have received this message in error and that any review, >>>> dissemination, distribution, or copying of this message is strictly >>>> prohibited. If >>>> you have received this communication in error, please notify the sender >>>> by >>>> telephone or e-mail (as shown above) immediately and destroy any and all >>>> copies of this message in your possession (whether hard copies or >>>> electronically stored copies). >>>>> >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in the body of a message to majord...@vger.kernel.org More >>>> >>>> majordomo >>>>> >>>>> info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the >>>> body of a message to majord...@vger.kernel.org More majordomo info at >>>> http://vger.kernel.org/majordomo-info.html >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the >>>> body of a message to majord...@vger.kernel.org More majordomo info at >>>> http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html