Re: newstore direction

kernel neophyte Tue, 20 Oct 2015 10:05:34 -0700

On Tue, Oct 20, 2015 at 6:19 AM, Mark Nelson <mnel...@redhat.com> wrote:
> On 10/20/2015 07:30 AM, Sage Weil wrote:
>>
>> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>>>
>>> +1, nowadays K-V DB care more about very small key-value pairs, say
>>> several bytes to a few KB, but in SSD case we only care about 4KB or
>>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>>> vendor are also trying to build this kind of interface, we had a NVM-L
>>> library but still under development.
>>
>>
>> Do you have an NVMKV link?  I see a paper and a stale github repo.. not
>> sure if I'm looking at the right thing.
>>
>> My concern with using a key/value interface for the object data is that
>> you end up with lots of key/value pairs (e.g., $inode_$offset =
>> $4kb_of_data) that is pretty inefficient to store and (depending on the
>> implementation) tends to break alignment.  I don't think these interfaces
>> are targetted toward block-sized/aligned payloads.  Storing just the
>> metadata (block allocation map) w/ the kv api and storing the data
>> directly on a block/page interface makes more sense to me.
>>
>> sage
>
>
> I get the feeling that some of the folks that were involved with nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for
> instance.  http://pmem.io might be a better bet, though I haven't looked
> closely at it.
>


IMO pmem.io is more suited for SCM (Storage Class Memory) than for SSD's.

If Newstore is target towards production deployments (Eventually
replacing FileStore someday) then IMO I agree with sage, i.e. rely on
a file system for doing block allocation.

-Neo


> Mark
>
>
>>
>>
>>>> -----Original Message-----
>>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>>> To: Sage Weil; Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> Hi Sage and Somnath,
>>>>    In my humble opinion, There is another more aggressive  solution than
>>>> raw
>>>> block device base keyvalue store as backend for objectstore. The new key
>>>> value  SSD device with transaction support would be  ideal to solve the
>>>> issues.
>>>> First of all, it is raw SSD device. Secondly , It provides key value
>>>> interface
>>>> directly from SSD. Thirdly, it can provide transaction support,
>>>> consistency will
>>>> be guaranteed by hardware device. It pretty much satisfied all of
>>>> objectstore
>>>> needs without any extra overhead since there is not any extra layer in
>>>> between device and objectstore.
>>>>     Either way, I strongly support to have CEPH own data format instead
>>>> of
>>>> relying on filesystem.
>>>>
>>>>    Regards,
>>>>    James
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 1:55 PM
>>>> To: Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>>>
>>>>> Sage,
>>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>>> dbs (for storing allocators and all). The reason is the unknown write
>>>>> amps they causes.
>>>>
>>>>
>>>> My hope is to keep behing the KeyValueDB interface (and/more change it
>>>> as
>>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>>> btree-
>>>> based one for high-end flash).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-ow...@vger.kernel.org
>>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
>>>>> Sent: Monday, October 19, 2015 12:49 PM
>>>>> To: ceph-devel@vger.kernel.org
>>>>> Subject: newstore direction
>>>>>
>>>>> The current design is based on two simple ideas:
>>>>>
>>>>>   1) a key/value interface is better way to manage all of our internal
>>>>> metadata (object metadata, attrs, layout, collection membership,
>>>>> write-ahead logging, overlay data, etc.)
>>>>>
>>>>>   2) a file system is well suited for storage object data (as files).
>>>>>
>>>>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
>>>>> few
>>>>> things:
>>>>>
>>>>>   - We currently write the data to the file, fsync, then commit the kv
>>>>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>>>>> journal, one for the kv txn to commit (at least once my rocksdb
>>>>> changes land... the kv commit is currently 2-3).  So two people are
>>>>> managing metadata, here: the fs managing the file metadata (with its
>>>>> own
>>>>> journal) and the kv backend (with its journal).
>>>>>
>>>>>   - On read we have to open files by name, which means traversing the
>>>>> fs
>>>>
>>>> namespace.  Newstore tries to keep it as flat and simple as possible,
>>>> but at a
>>>> minimum it is a couple btree lookups.  We'd love to use open by handle
>>>> (which would reduce this to 1 btree traversal), but running the daemon
>>>> as
>>>> ceph and not root makes that hard...
>>>>>
>>>>>
>>>>>   - ...and file systems insist on updating mtime on writes, even when
>>>>> it is a
>>>>
>>>> overwrite with no allocation changes.  (We don't care about mtime.)
>>>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>>>> brainfreeze.
>>>>>
>>>>>
>>>>>   - XFS is (probably) never going going to give us data checksums,
>>>>> which we
>>>>
>>>> want desperately.
>>>>>
>>>>>
>>>>> But what's the alternative?  My thought is to just bite the bullet and
>>>>
>>>> consume a raw block device directly.  Write an allocator, hopefully keep
>>>> it
>>>> pretty simple, and manage it in kv store along with all of our other
>>>> metadata.
>>>>>
>>>>>
>>>>> Wins:
>>>>>
>>>>>   - 2 IOs for most: one to write the data to unused space in the block
>>>>> device,
>>>>
>>>> one to commit our transaction (vs 4+ before).  For overwrites, we'd have
>>>> one
>>>> io to do our write-ahead log (kv journal), then do the overwrite async
>>>> (vs 4+
>>>> before).
>>>>>
>>>>>
>>>>>   - No concern about mtime getting in the way
>>>>>
>>>>>   - Faster reads (no fs lookup)
>>>>>
>>>>>   - Similarly sized metadata for most objects.  If we assume most
>>>>> objects are
>>>>
>>>> not fragmented, then the metadata to store the block offsets is about
>>>> the
>>>> same size as the metadata to store the filenames we have now.
>>>>>
>>>>>
>>>>> Problems:
>>>>>
>>>>>   - We have to size the kv backend storage (probably still an XFS
>>>>> partition) vs the block storage.  Maybe we do this anyway (put
>>>>> metadata on
>>>>> SSD!) so it won't matter.  But what happens when we are storing gobs of
>>>>
>>>> rgw index data or cephfs metadata?  Suddenly we are pulling storage out
>>>> of a
>>>> different pool and those aren't currently fungible.
>>>>>
>>>>>
>>>>>   - We have to write and maintain an allocator.  I'm still optimistic
>>>>> this can be
>>>>
>>>> reasonbly simple, especially for the flash case (where fragmentation
>>>> isn't
>>>> such an issue as long as our blocks are reasonbly sized).  For disk we
>>>> may
>>>> beed to be moderately clever.
>>>>>
>>>>>
>>>>>   - We'll need a fsck to ensure our internal metadata is consistent.
>>>>> The good
>>>>
>>>> news is it'll just need to validate what we have stored in the kv store.
>>>>>
>>>>>
>>>>> Other thoughts:
>>>>>
>>>>>   - We might want to consider whether dm-thin or bcache or other block
>>>>
>>>> layers might help us with elasticity of file vs block areas.
>>>>>
>>>>>
>>>>>   - Rocksdb can push colder data to a second directory, so we could
>>>>> have a fast ssd primary area (for wal and most metadata) and a second
>>>>> hdd directory for stuff it has to push off.  Then have a conservative
>>>>> amount of file space on the hdd.  If our block fills up, use the
>>>>> existing file mechanism to put data there too.  (But then we have to
>>>>> maintain both the current kv + file approach and not go all-in on kv +
>>>>> block.)
>>>>>
>>>>> Thoughts?
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majord...@vger.kernel.org More
>>>>
>>>> majordomo
>>>>>
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>>> is
>>>>
>>>> intended only for the use of the designated recipient(s) named above. If
>>>> the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified
>>>> that you have received this message in error and that any review,
>>>> dissemination, distribution, or copying of this message is strictly
>>>> prohibited. If
>>>> you have received this communication in error, please notify the sender
>>>> by
>>>> telephone or e-mail (as shown above) immediately and destroy any and all
>>>> copies of this message in your possession (whether hard copies or
>>>> electronically stored copies).
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majord...@vger.kernel.org More
>>>>
>>>> majordomo
>>>>>
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the
>>>> body of a message to majord...@vger.kernel.org More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the
>>>> body of a message to majord...@vger.kernel.org More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: newstore direction

Reply via email to