Re: newstore direction

2015-10-23 Thread Gregory Farnum
On Fri, Oct 23, 2015 at 7:59 AM, Howard Chu  wrote:
> If the stream of writes is large enough, you could omit fsync because
> everything is being forced out of the cache to disk anyway. In that
> scenario, the only thing that matters is that the writes get forced out in
> the order you intended, so that an interruption or crash leaves you in a
> known (or knowable) state vs unknown.

The RADOS storage semantics actually require that we know it's durable
on disk as well, unfortunately. But ordered writes would probably let
us batch up commit points in ways that are a lot friendlier for the
drives!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-23 Thread Howard Chu

Ric Wheeler wrote:

On 10/23/2015 07:06 AM, Ric Wheeler wrote:

On 10/23/2015 02:21 AM, Howard Chu wrote:

Normally, best practice is to use batching to avoid paying worst case latency
>when you do a synchronous IO. Write a batch of files or appends without

fsync,

>then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)



One other note, the file & storage kernel people discussed using ordering
years ago. One of the issues is that the devices themselves need to support.
While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and
still does not as far as I know?) support ordered tags.


Yes, that's a bigger problem. ATA NCQ/TCQ aren't up to the job.

>>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but
>>> nothing above that layer makes use of it.
>>
>> I think that if the stream on either side of the barrier is large enough,
>> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2,
>> should have the same performance.

>> Not clear to me if we could do away with an fsync to trigger a cache flush
>> here either - do SCSI ordered tags require that the writes be acknowledged
>> only when durable, or can the device ack them once the target has them
>> (including in a volatile write cache)?

fsync() is too blunt a tool; its use gives you both C and D of ACID 
(Consistency and Durability). Ordered tags give you Consistency; there are 
lots of applications that can live without perfect Durability but losing 
Consistency is a major headache.


If the stream of writes is large enough, you could omit fsync because 
everything is being forced out of the cache to disk anyway. In that scenario, 
the only thing that matters is that the writes get forced out in the order you 
intended, so that an interruption or crash leaves you in a known (or knowable) 
state vs unknown.


--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-23 Thread Milosz Tanski
On Thu, Oct 22, 2015 at 11:16 PM, Howard Chu  wrote:
> Milosz Tanski  adfin.com> writes:
>
>>
>> On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil  redhat.com> wrote:
>> > On Tue, 20 Oct 2015, John Spray wrote:
>> >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  redhat.com> wrote:
>> >> >  - We have to size the kv backend storage (probably still an XFS
>> >> > partition) vs the block storage.  Maybe we do this anyway (put
> metadata on
>> >> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> >> > rgw index data or cephfs metadata?  Suddenly we are pulling storage
> out of
>> >> > a different pool and those aren't currently fungible.
>> >>
>> >> This is the concerning bit for me -- the other parts one "just" has to
>> >> get the code right, but this problem could linger and be something we
>> >> have to keep explaining to users indefinitely.  It reminds me of cases
>> >> in other systems where users had to make an educated guess about inode
>> >> size up front, depending on whether you're expecting to efficiently
>> >> store a lot of xattrs.
>> >>
>> >> In practice it's rare for users to make these kinds of decisions well
>> >> up-front: it really needs to be adjustable later, ideally
>> >> automatically.  That could be pretty straightforward if the KV part
>> >> was stored directly on block storage, instead of having XFS in the
>> >> mix.  I'm not quite up with the state of the art in this area: are
>> >> there any reasonable alternatives for the KV part that would consume
>> >> some defined range of a block device from userspace, instead of
>> >> sitting on top of a filesystem?
>> >
>> > I agree: this is my primary concern with the raw block approach.
>> >
>> > There are some KV alternatives that could consume block, but the problem
>> > would be similar: we need to dynamically size up or down the kv portion of
>> > the device.
>> >
>> > I see two basic options:
>> >
>> > 1) Wire into the Env abstraction in rocksdb to provide something just
>> > smart enough to let rocksdb work.  It isn't much: named files (not that
>> > many--we could easily keep the file table in ram), always written
>> > sequentially, to be read later with random access. All of the code is
>> > written around abstractions of SequentialFileWriter so that everything
>> > posix is neatly hidden in env_posix (and there are various other env
>> > implementations for in-memory mock tests etc.).
>> >
>> > 2) Use something like dm-thin to sit between the raw block device and XFS
>> > (for rocksdb) and the block device consumed by newstore.  As long as XFS
>> > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
>> > files in their entirety) we can fstrim and size down the fs portion.  If
>> > we similarly make newstores allocator stick to large blocks only we would
>> > be able to size down the block portion as well.  Typical dm-thin block
>> > sizes seem to range from 64KB to 512KB, which seems reasonable enough to
>> > me.  In fact, we could likely just size the fs volume at something
>> > conservatively large (like 90%) and rely on -o discard or periodic fstrim
>> > to keep its actual utilization in check.
>> >
>>
>> I think you could prototype a raw block device OSD store using LMDB as
>> a starting point. I know there's been some experiments using LMDB as
>> KV store before with positive read numbers and not great write
>> numbers.
>>
>> 1. It mmaps, just mmap the raw disk device / partition. I've done this
>> as an experiment before, I can dig up a patch for LMDB.
>> 2. It already has a free space management strategy. I'm prob it's not
>> right for the OSDs in the long term but there's something to start
>> there with.
>> 3. It's already supports transactions / COW.
>> 4. LMDB isn't a huge code base so it might be a good place to start /
>> evolve code from.
>> 5. You're not starting a multi-year effort at the 0 point.
>>
>> As to the not great write performance, that could be addressed by
>> write transaction merging (what mysql implemented a few years ago).
>
> We have a heavily hacked version of LMDB contributed by VMware that
> implements a WAL. In my preliminary testing it performs synchronous writes
> 30x faster (on average) than current LMDB. Their version unfortunately
> slashed'n'burned a lot of LMDB features that other folks actually need, so
> we can't use it as-is. Currently working on rationalizing the approach and
> merging it into mdb.master.
>
> The reasons for the WAL approach:
>   1) obviously sequential writes are cheaper than random writes.
>   2) fsync() of a small log file will always be faster than fsync() of a
> large DB. I.e., fsync() latency is proportional to the total number of pages
> in the file, not just the number of dirty pages.

This a bit off topic (from new store). More to Howard about LMDB
internals and write serialization.

Howard, there is way to make progress on pending transactions without
WAL. LMDB is already COW so hypothetically further 

Re: newstore direction

2015-10-23 Thread Ric Wheeler

On 10/23/2015 02:21 AM, Howard Chu wrote:

Normally, best practice is to use batching to avoid paying worst case latency
>when you do a synchronous IO. Write a batch of files or appends without

fsync,

>then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)

A bit of a shame that Linux's SCSI drivers support Ordering attributes but
nothing above that layer makes use of it.


I think that if the stream on either side of the barrier is large enough, using 
ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, should have 
the same performance.


Not clear to me if we could do away with an fsync to trigger a cache flush here 
either - do SCSI ordered tags require that the writes be acknowledged only when 
durable, or can the device ack them once the target has them (including in a 
volatile write cache)?


Ric

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-23 Thread Ric Wheeler

On 10/23/2015 07:06 AM, Ric Wheeler wrote:

On 10/23/2015 02:21 AM, Howard Chu wrote:

Normally, best practice is to use batching to avoid paying worst case latency
>when you do a synchronous IO. Write a batch of files or appends without

fsync,

>then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)

A bit of a shame that Linux's SCSI drivers support Ordering attributes but
nothing above that layer makes use of it.


I think that if the stream on either side of the barrier is large enough, 
using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, should 
have the same performance.


Not clear to me if we could do away with an fsync to trigger a cache flush 
here either - do SCSI ordered tags require that the writes be acknowledged 
only when durable, or can the device ack them once the target has them 
(including in a volatile write cache)?


Ric




One other note, the file & storage kernel people discussed using ordering years 
ago. One of the issues is that the devices themselves need to support. While 
S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and still does 
not as far as I know?) support ordered tags.


Regards,

Ric


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-23 Thread Ric Wheeler

On 10/23/2015 10:59 AM, Howard Chu wrote:

Ric Wheeler wrote:

On 10/23/2015 07:06 AM, Ric Wheeler wrote:

On 10/23/2015 02:21 AM, Howard Chu wrote:

Normally, best practice is to use batching to avoid paying worst case latency
>when you do a synchronous IO. Write a batch of files or appends without

fsync,

>then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)



One other note, the file & storage kernel people discussed using ordering
years ago. One of the issues is that the devices themselves need to support.
While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and
still does not as far as I know?) support ordered tags.


Yes, that's a bigger problem. ATA NCQ/TCQ aren't up to the job.

>>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but
>>> nothing above that layer makes use of it.
>>
>> I think that if the stream on either side of the barrier is large enough,
>> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2,
>> should have the same performance.

>> Not clear to me if we could do away with an fsync to trigger a cache flush
>> here either - do SCSI ordered tags require that the writes be acknowledged
>> only when durable, or can the device ack them once the target has them
>> (including in a volatile write cache)?

fsync() is too blunt a tool; its use gives you both C and D of ACID 
(Consistency and Durability). Ordered tags give you Consistency; there are 
lots of applications that can live without perfect Durability but losing 
Consistency is a major headache.


If the stream of writes is large enough, you could omit fsync because 
everything is being forced out of the cache to disk anyway. In that scenario, 
the only thing that matters is that the writes get forced out in the order you 
intended, so that an interruption or crash leaves you in a known (or knowable) 
state vs unknown.




I do agree that fsync is quite a blunt tool, but you cannot assume that a stream 
of writes will flush the cache - that is extremely firmware dependent.


Pretty common to leave small IO's in cache and let larger IO's stream directly 
to the backing device (platter, etc) - those small objects can stay live and 
non-durable for days under some heavy workloads :)


ric

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-23 Thread Howard Chu
Ric Wheeler  redhat.com> writes:

> 
> On 10/21/2015 09:32 AM, Sage Weil wrote:
> > On Tue, 20 Oct 2015, Ric Wheeler wrote:
> >>> Now:
> >>>   1 io  to write a new file
> >>> 1-2 ios to sync the fs journal (commit the inode, alloc change)
> >>> (I see 2 journal IOs on XFS and only 1 on ext4...)
> >>>   1 io  to commit the rocksdb journal (currently 3, but will drop to
> >>> 1 with xfs fix and my rocksdb change)
> >> I think that might be too pessimistic - the number of discrete IO's
sent down
> >> to a spinning disk make much less impact on performance than the number of
> >> fsync()'s since they IO's all land in the write cache.  Some newer spinning
> >> drives have a non-volatile write cache, so even an fsync() might not end up
> >> doing the expensive data transfer to the platter.
> > True, but in XFS's case at least the file data and journal are not
> > colocated, so its 2 seeks for the new file write+fdatasync and another for
> > the rocksdb journal commit.  Of course, with a deep queue, we're doing
> > lots of these so there's be fewer journal commits on both counts, but the
> > lower bound on latency of a single write is still 3 seeks, and that bound
> > is pretty critical when you also have network round trips and replication
> > (worst out of 2) on top.
> 
> What are the performance goals we are looking for?
> 
> Small, synchronous writes/second?
> 
> File creates/second?
> 
> I suspect that looking at things like seeks/write is probably looking at the 
> wrong level of performance challenges.  Again, when you write to a modern
drive, 
> you write to its write cache and it decides internally when/how to destage to 
> the platter.
> 
> If you look at the performance of XFS with streaming workloads, it will
tend to 
> max out the bandwidth of the underlaying storage.
> 
> If we need IOP's/file writes, etc, we should be clear on what we are
aiming at.
> 
> >
> >> It would be interesting to get the timings on the IO's you see to
measure the
> >> actual impact.
> > I observed this with the journaling workload for rocksdb, but I assume the
> > journaling behavior is the same regardless of what is being journaled.
> > For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
> > blktrace showed an IO to the file, and 2 IOs to the journal.  I believe
> > the first one is the record for the inode update, and the second is the
> > journal 'commit' record (though I forget how I decided that).  My guess is
> > that XFS is being extremely careful about journal integrity here and not
> > writing the commit record until it knows that the preceding records landed
> > on stable storage.  For ext4, the latency was about ~20ms, and blktrace
> > showed the IO to the file and then a single journal IO.  When I made the
> > rocksdb change to overwrite an existing, prewritten file, the latency
> > dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
> > (XFS still showed the 2 journal commit IOs, but Dave just posted the fix
> > for that on the XFS list today.)

> Normally, best practice is to use batching to avoid paying worst case latency 
> when you do a synchronous IO. Write a batch of files or appends without
fsync, 
> then go back and fsync and you will pay that latency once (not per file/op).

If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)

A bit of a shame that Linux's SCSI drivers support Ordering attributes but
nothing above that layer makes use of it.
-- 
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/ 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-23 Thread Howard Chu

Gregory Farnum wrote:

On Fri, Oct 23, 2015 at 7:59 AM, Howard Chu  wrote:

If the stream of writes is large enough, you could omit fsync because
everything is being forced out of the cache to disk anyway. In that
scenario, the only thing that matters is that the writes get forced out in
the order you intended, so that an interruption or crash leaves you in a
known (or knowable) state vs unknown.


The RADOS storage semantics actually require that we know it's durable
on disk as well, unfortunately. But ordered writes would probably let
us batch up commit points in ways that are a lot friendlier for the
drives!


Ah, that's too bad. LMDB does fine with only ordering, but never mind.

--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Orit Wasserman
On Thu, 2015-10-22 at 02:12 +, Allen Samuels wrote:
> One of the biggest changes that flash is making in the storage world is that 
> the way basic trade-offs in storage management software architecture are 
> being affected. In the HDD world CPU time per IOP was relatively 
> inconsequential, i.e., it had little effect on overall performance which was 
> limited by the physics of the hard drive. Flash is now inverting that 
> situation. When you look at the performance levels being delivered in the 
> latest generation of NVMe SSDs you rapidly see that that storage itself is 
> generally no longer the bottleneck (speaking about BW, not latency of course) 
> but rather it's the system sitting in front of the storage that is the 
> bottleneck. Generally it's the CPU cost of an IOP.
> 
> When Sandisk first starting working with Ceph (Dumpling) the design of 
> librados and the OSD lead to the situation that the CPU cost of an IOP was 
> dominated by context switches and network socket handling. Over time, much of 
> that has been addressed. The socket handling code has been re-written (more 
> than once!) some of the internal queueing in the OSD (and the associated 
> context switches) have been eliminated. As the CPU costs have dropped, 
> performance on flash has improved accordingly.
> 
> Because we didn't want to completely re-write the OSD (time-to-market and 
> stability drove that decision), we didn't move it from the current "thread 
> per IOP" model into a truly asynchronous "thread per CPU core" model that 
> essentially eliminates context switches in the IO path. But a fully optimized 
> OSD would go down that path (at least part-way). I believe it's been proposed 
> in the past. Perhaps a hybrid "fast-path" style could get most of the 
> benefits while preserving much of the legacy code.
> 

+1
It not just reducing context switches but also about removing contention
and data copies and getting better cache utilization.

Scylladb just did this to cassandra (using seastar library):
http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/

Orit

> I believe this trend toward thread-per-core software development will also 
> tend to support the "do it in user-space" trend. That's because most of the 
> kernel and file-system interface is architected around the blocking 
> "thread-per-IOP" model and is unlikely to change in the future.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samu...@sandisk.com
> 
> -Original Message-
> From: Martin Millnert [mailto:mar...@millnert.se]
> Sent: Thursday, October 22, 2015 6:20 AM
> To: Mark Nelson <mnel...@redhat.com>
> Cc: Ric Wheeler <rwhee...@redhat.com>; Allen Samuels 
> <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
> ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
> 
> Adding 2c
> 
> On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> > My thought is that there is some inflection point where the userland
> > kvstore/block approach is going to be less work, for everyone I think,
> > than trying to quickly discover, understand, fix, and push upstream
> > patches that sometimes only really benefit us.  I don't know if we've
> > truly hit that that point, but it's tough for me to find flaws with
> > Sage's argument.
> 
> Regarding the userland / kernel land aspect of the topic, there are further 
> aspects AFAIK not yet addressed in the thread:
> In the networking world, there's been development on memory mapped (multiple 
> approaches exist) userland networking, which for packet management has the 
> benefit of - for very, very specific applications of networking code - 
> avoiding e.g. per-packet context switches etc, and streamlining processor 
> cache management performance. People have gone as far as removing CPU cores 
> from CPU scheduler to completely dedicate them to the networking task at hand 
> (cache optimizations). There are various latency/throughput (bulking) 
> optimizations applicable, but at the end of the day, it's about keeping the 
> CPU bus busy with "revenue" bus traffic.
> 
> Granted, storage IO operations may be much heavier in cycle counts for 
> context switches to ever appear as a problem in themselves, certainly for 
> slower SSDs and HDDs. However, when going for truly high performance IO, 
> *every* hurdle in the data path counts toward the total latency.
> (And really, high performance random IO characteristics approaches the 
> networking, per-packet handling characteristics).  Now, I'm n

Re: newstore direction

2015-10-22 Thread Christoph Hellwig
On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote:
> For example: we need to do an overwrite of an existing object that is 
> atomic with respect to a larger ceph transaction (we're updating a bunch 
> of other metadata at the same time, possibly overwriting or appending to 
> multiple files, etc.).  XFS and ext4 aren't cow file systems, so plugging 
> into the transaction infrastructure isn't really an option (and even after 
> several years of trying to do it with btrfs it proved to be impractical).  

Not that I'm disagreeing with most of your points, but we can do things
like that with swapext-like hacks.  Below is my half year old prototype
of an O_ATOMIC implementation for XFS that gives you atomic out of place
writes.

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..001dd49 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -740,7 +740,7 @@ static int __init fcntl_init(void)
 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 * is defined as O_NONBLOCK on some platforms and not on others.
 */
-   BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+   BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
O_RDONLY| O_WRONLY  | O_RDWR|
O_CREAT | O_EXCL| O_NOCTTY  |
O_TRUNC | O_APPEND  | /* O_NONBLOCK | */
@@ -748,6 +748,7 @@ static int __init fcntl_init(void)
O_DIRECT| O_LARGEFILE   | O_DIRECTORY   |
O_NOFOLLOW  | O_NOATIME | O_CLOEXEC |
__FMODE_EXEC| O_PATH| __O_TMPFILE   |
+   O_ATOMIC|
__FMODE_NONOTIFY
));
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaa..8eafca6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4681,14 +4681,14 @@ xfs_bmap_del_extent(
xfs_btree_cur_t *cur,   /* if null, not a btree */
xfs_bmbt_irec_t *del,   /* data to remove from extents */
int *logflagsp, /* inode logging flags */
-   int whichfork) /* data or attr fork */
+   int whichfork, /* data or attr fork */
+   boolfree_blocks) /* free extent at end of routine */
 {
xfs_filblks_t   da_new; /* new delay-alloc indirect blocks */
xfs_filblks_t   da_old; /* old delay-alloc indirect blocks */
xfs_fsblock_t   del_endblock=0; /* first block past del */
xfs_fileoff_t   del_endoff; /* first offset past del */
int delay;  /* current block is delayed allocated */
-   int do_fx;  /* free extent at end of routine */
xfs_bmbt_rec_host_t *ep;/* current extent entry pointer */
int error;  /* error return value */
int flags;  /* inode logging flags */
@@ -4712,8 +4712,8 @@ xfs_bmap_del_extent(
 
mp = ip->i_mount;
ifp = XFS_IFORK_PTR(ip, whichfork);
-   ASSERT((*idx >= 0) && (*idx < ifp->if_bytes /
-   (uint)sizeof(xfs_bmbt_rec_t)));
+   ASSERT(*idx >= 0);
+   ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t));
ASSERT(del->br_blockcount > 0);
ep = xfs_iext_get_ext(ifp, *idx);
xfs_bmbt_get_all(ep, );
@@ -4746,10 +4746,13 @@ xfs_bmap_del_extent(
len = del->br_blockcount;
do_div(bno, mp->m_sb.sb_rextsize);
do_div(len, mp->m_sb.sb_rextsize);
-   error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len);
-   if (error)
-   goto done;
-   do_fx = 0;
+   if (free_blocks) {
+   error = xfs_rtfree_extent(tp, bno,
+   (xfs_extlen_t)len);
+   if (error)
+   goto done;
+   free_blocks = 0;
+   }
nblks = len * mp->m_sb.sb_rextsize;
qfield = XFS_TRANS_DQ_RTBCOUNT;
}
@@ -4757,7 +4760,6 @@ xfs_bmap_del_extent(
 * Ordinary allocation.
 */
else {
-   do_fx = 1;
nblks = del->br_blockcount;
qfield = XFS_TRANS_DQ_BCOUNT;
}
@@ -4777,7 +4779,7 @@ xfs_bmap_del_extent(
da_old = startblockval(got.br_startblock);
da_new = 0;
nblks = 0;
-   do_fx = 0;
+   free_blocks = 0;
}
/*
 * Set flag value to use in switch statement.
@@ -4963,7 +4965,7 @@ xfs_bmap_del_extent(
/*
 * If we 

RE: newstore direction

2015-10-22 Thread James (Fei) Liu-SSI
Hi Sage and other fellow cephers,
  I truly share the pains with you  all about filesystem while I am working on  
objectstore to improve the performance. As mentioned , there is nothing wrong 
with filesystem. Just the Ceph as one of  use case need more supports but not 
provided in near future by filesystem no matter what reasons.

   There are so many techniques  pop out which can help to improve performance 
of OSD.  User space driver(DPDK from Intel) is one of them. It not only gives 
you the storage allocator,  also gives you the thread scheduling support,  CPU 
affinity , NUMA friendly, polling  which  might fundamentally change the 
performance of objectstore.  It should not be hard to improve CPU utilization 
3x~5x times, higher IOPS etc.
I totally agreed that goal of filestore is to gives enough support for 
filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
design goal of objectstore should focus on giving the best  performance for OSD 
with new techniques. These two goals are not going to conflict with each other. 
 They are just for different purposes to make Ceph not only more stable but 
also better.  

  Scylla mentioned by Orit is a good example .

  Thanks all.

  Regards,
  James   

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Thursday, October 22, 2015 5:50 AM
To: Ric Wheeler
Cc: Orit Wasserman; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to 
> pretty much all of our key customers about local file systems and 
> storage - customers all have migrated over to using normal file systems under 
> Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard 
> file systems and only have seen one account running on a raw block 
> store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO 
> path is identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some 
> time talking to the local file system gurus about this in detail.  I 
> can help with that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or a 
few) huge files and the user space app already has all the complexity of a 
filesystem-like thing (with its own internal journal, allocators, garbage 
collection, etc.).  Do they just do this to ease administrative tasks like 
backup?


This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that there are 
two independent layers journaling and managing different types of consistency 
penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file 
system to work around what it is used to: we swap extents to avoid write-ahead 
(see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, 
O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that lives 
within it (pretending the file is a block device).  The file system rarely gets 
in the way (assuming the file is prewritten and we don't do anything stupid).  
But it doesn't give us anything a block device wouldn't, and it doesn't save us 
any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) 
complexity to 2.  On the other hand, if you step back and view teh entire stack 
(ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet 
still slower.  Given we ultimately have to support both (both as an upstream 
and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the beaten 
path (1) to anything mildly exotic (1b) we have been bitten by obscure file 
systems bugs.  And that's assume we get everything we need upstream... which is 
probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better 
support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a huge 
amount of sense of a ton of different systems.  But our situations is a bit 
different: we always own the entire device (and often the server), so there is 
no need to share with other users or apps (and when you do, you just use the 
existing FileStore backend).  And as you know performance is a huge pain point. 
 We are already handicapped by virtue of being distributed and strongly 
consistent; we can't afford to give away more to a storage layer that isn't 
providing us much (or the 

Re: newstore direction

2015-10-22 Thread Samuel Just
Since the changes which moved the pg log and the pg info into the pg
object space, I think it's now the case that any transaction submitted
to the objectstore updates a disjoint range of objects determined by
the sequencer.  It might be easier to exploit that parallelism if we
control allocation and allocation related metadata.  We could split
the store into N pieces which partition the pg space (one additional
one for the meta sequencer?) with one rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of
global allocation decisions) and managed more finely within each
partition.  The main challenge would be avoiding internal
fragmentation of those, but at least defragmentation can be managed on
a per-partition basis.  Such parallelism is probably necessary to
exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
<james@ssi.samsung.com> wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working 
> on  objectstore to improve the performance. As mentioned , there is nothing 
> wrong with filesystem. Just the Ceph as one of  use case need more supports 
> but not provided in near future by filesystem no matter what reasons.
>
>There are so many techniques  pop out which can help to improve 
> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
> not only gives you the storage allocator,  also gives you the thread 
> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
> fundamentally change the performance of objectstore.  It should not be hard 
> to improve CPU utilization 3x~5x times, higher IOPS etc.
> I totally agreed that goal of filestore is to gives enough support for 
> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
> design goal of objectstore should focus on giving the best  performance for 
> OSD with new techniques. These two goals are not going to conflict with each 
> other.  They are just for different purposes to make Ceph not only more 
> stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems 
>> under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten), then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a 
> few) huge files and the user space app already has all the complexity of a 
> filesystem-like thing (with its own internal journal, allocators, garbage 
> collection, etc.).  Do they just do this to ease administrative tasks like 
> backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that there 
> are two independent layers journaling and managing different types of 
> consistency penalizes us.
>
> 1b) We get clever and start using obscure and/or custom ioctls in the file 
> system to work around what it is used to: we swap extents to avoid 
> write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, 
> batch fsync, O_ATOMIC, setext ioctl, etc.
>
> 2) We preallocate huge files and write a user-space object system that lives 
> within it (pretending the file is a block device).  The file system rarely 
> gets in the way (assuming the file is prewritten and we don't do anything 
> stupid).  But it doesn't give us anything a block device wouldn't, and it 
> doesn't save us any complexity in our code.
>
> At the end of the day, 1 an

Re: newstore direction

2015-10-22 Thread Samuel Just
Ah, except for the snapmapper.  We can split the snapmapper in the
same way, though, as long as we are careful with the name.
-Sam

On Thu, Oct 22, 2015 at 4:42 PM, Samuel Just <sj...@redhat.com> wrote:
> Since the changes which moved the pg log and the pg info into the pg
> object space, I think it's now the case that any transaction submitted
> to the objectstore updates a disjoint range of objects determined by
> the sequencer.  It might be easier to exploit that parallelism if we
> control allocation and allocation related metadata.  We could split
> the store into N pieces which partition the pg space (one additional
> one for the meta sequencer?) with one rocksdb instance for each.
> Space could then be parcelled out in large pieces (small frequency of
> global allocation decisions) and managed more finely within each
> partition.  The main challenge would be avoiding internal
> fragmentation of those, but at least defragmentation can be managed on
> a per-partition basis.  Such parallelism is probably necessary to
> exploit the full throughput of some ssds.
> -Sam
>
> On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI
> <james@ssi.samsung.com> wrote:
>> Hi Sage and other fellow cephers,
>>   I truly share the pains with you  all about filesystem while I am working 
>> on  objectstore to improve the performance. As mentioned , there is nothing 
>> wrong with filesystem. Just the Ceph as one of  use case need more supports 
>> but not provided in near future by filesystem no matter what reasons.
>>
>>There are so many techniques  pop out which can help to improve 
>> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
>> not only gives you the storage allocator,  also gives you the thread 
>> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
>> fundamentally change the performance of objectstore.  It should not be hard 
>> to improve CPU utilization 3x~5x times, higher IOPS etc.
>> I totally agreed that goal of filestore is to gives enough support for 
>> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
>> design goal of objectstore should focus on giving the best  performance for 
>> OSD with new techniques. These two goals are not going to conflict with each 
>> other.  They are just for different purposes to make Ceph not only more 
>> stable but also better.
>>
>>   Scylla mentioned by Orit is a good example .
>>
>>   Thanks all.
>>
>>   Regards,
>>   James
>>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
>> Sent: Thursday, October 22, 2015 5:50 AM
>> To: Ric Wheeler
>> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
>> Subject: Re: newstore direction
>>
>> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>>> You will have to trust me on this as the Red Hat person who spoke to
>>> pretty much all of our key customers about local file systems and
>>> storage - customers all have migrated over to using normal file systems 
>>> under Oracle/DB2.
>>> Typically, they use XFS or ext4.  I don't know of any non-standard
>>> file systems and only have seen one account running on a raw block
>>> store in 8 years
>>> :)
>>>
>>> If you have a pre-allocated file and write using O_DIRECT, your IO
>>> path is identical in terms of IO's sent to the device.
>>>
>>> If we are causing additional IO's, then we really need to spend some
>>> time talking to the local file system gurus about this in detail.  I
>>> can help with that conversation.
>>
>> If the file is truly preallocated (that is, prewritten with zeros...
>> fallocate doesn't help here because the extents is marked unwritten), then
>> sure: there is very little change in the data path.
>>
>> But at that point, what is the point?  This only works if you have one (or a 
>> few) huge files and the user space app already has all the complexity of a 
>> filesystem-like thing (with its own internal journal, allocators, garbage 
>> collection, etc.).  Do they just do this to ease administrative tasks like 
>> backup?
>>
>>
>> This is the fundamental tradeoff:
>>
>> 1) We have a file per object.  We fsync like crazy and the fact that there 
>> are two independent layers journaling and managing different types of 
>> consistency penalizes us.
>>
>> 1b) We get clever and start using obscure and/or custom ioctls in the file 
>> system to work around what it 

RE: newstore direction

2015-10-22 Thread Allen Samuels
How would this kind of split affect small transactions? Will each split be 
separately transactionally consistent or is there some kind of meta-transaction 
that synchronizes each of the splits?


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
Sent: Friday, October 23, 2015 8:42 AM
To: James (Fei) Liu-SSI <james@ssi.samsung.com>
Cc: Sage Weil <sw...@redhat.com>; Ric Wheeler <rwhee...@redhat.com>; Orit 
Wasserman <owass...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Since the changes which moved the pg log and the pg info into the pg object 
space, I think it's now the case that any transaction submitted to the 
objectstore updates a disjoint range of objects determined by the sequencer.  
It might be easier to exploit that parallelism if we control allocation and 
allocation related metadata.  We could split the store into N pieces which 
partition the pg space (one additional one for the meta sequencer?) with one 
rocksdb instance for each.
Space could then be parcelled out in large pieces (small frequency of global 
allocation decisions) and managed more finely within each partition.  The main 
challenge would be avoiding internal fragmentation of those, but at least 
defragmentation can be managed on a per-partition basis.  Such parallelism is 
probably necessary to exploit the full throughput of some ssds.
-Sam

On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI 
<james@ssi.samsung.com> wrote:
> Hi Sage and other fellow cephers,
>   I truly share the pains with you  all about filesystem while I am working 
> on  objectstore to improve the performance. As mentioned , there is nothing 
> wrong with filesystem. Just the Ceph as one of  use case need more supports 
> but not provided in near future by filesystem no matter what reasons.
>
>There are so many techniques  pop out which can help to improve 
> performance of OSD.  User space driver(DPDK from Intel) is one of them. It 
> not only gives you the storage allocator,  also gives you the thread 
> scheduling support,  CPU affinity , NUMA friendly, polling  which  might 
> fundamentally change the performance of objectstore.  It should not be hard 
> to improve CPU utilization 3x~5x times, higher IOPS etc.
> I totally agreed that goal of filestore is to gives enough support for 
> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new 
> design goal of objectstore should focus on giving the best  performance for 
> OSD with new techniques. These two goals are not going to conflict with each 
> other.  They are just for different purposes to make Ceph not only more 
> stable but also better.
>
>   Scylla mentioned by Orit is a good example .
>
>   Thanks all.
>
>   Regards,
>   James
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, October 22, 2015 5:50 AM
> To: Ric Wheeler
> Cc: Orit Wasserman; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On Wed, 21 Oct 2015, Ric Wheeler wrote:
>> You will have to trust me on this as the Red Hat person who spoke to
>> pretty much all of our key customers about local file systems and
>> storage - customers all have migrated over to using normal file systems 
>> under Oracle/DB2.
>> Typically, they use XFS or ext4.  I don't know of any non-standard
>> file systems and only have seen one account running on a raw block
>> store in 8 years
>> :)
>>
>> If you have a pre-allocated file and write using O_DIRECT, your IO
>> path is identical in terms of IO's sent to the device.
>>
>> If we are causing additional IO's, then we really need to spend some
>> time talking to the local file system gurus about this in detail.  I
>> can help with that conversation.
>
> If the file is truly preallocated (that is, prewritten with zeros...
> fallocate doesn't help here because the extents is marked unwritten),
> then
> sure: there is very little change in the data path.
>
> But at that point, what is the point?  This only works if you have one (or a 
> few) huge files and the user space app already has all the complexity of a 
> filesystem-like thing (with its own internal journal, allocators, garbage 
> collection, etc.).  Do they just do this to ease administrative tasks like 
> backup?
>
>
> This is the fundamental tradeoff:
>
> 1) We have a file per object.  We fsync like crazy and the fact that the

Re: newstore direction

2015-10-22 Thread Ric Wheeler
I disagree with your point still - your argument was that customers don't like 
to update their code so we cannot rely on them moving to better file system 
code.  Those same customers would be *just* as reluctant to upgrade OSD code.  
Been there, done that in pure block storage, pure object storage and in file 
system code (customers just don't care about the protocol, the conservative 
nature is consistent).


Not a casual observation, I have been building storage systems since the 
mid-80's.

Regards,

Ric

On 10/21/2015 09:22 PM, Allen Samuels wrote:

I agree. My only point was that you still have to factor this time into the argument that 
by continuing to put NewStore on top of a file system you'll get to a stable system much 
sooner than the longer development path of doing your own raw storage allocator. IMO, 
once you factor that into the equation the "on top of an FS" path doesn't look 
like such a clear winner.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com]
Sent: Thursday, October 22, 2015 10:17 AM
To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 08:53 PM, Allen Samuels wrote:

Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many 
companies standardize on a particular release of a particular distro. Getting them to 
switch to a new release -- even a "bug fix" point release -- is a major 
undertaking that often is a complete roadblock. Just my experience. YMMV.


Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).

If someone is deploying a critical server for storage, then it falls back on 
the storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).

ric




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Ric Wheeler

On 10/22/2015 08:50 AM, Sage Weil wrote:

On Wed, 21 Oct 2015, Ric Wheeler wrote:

You will have to trust me on this as the Red Hat person who spoke to pretty
much all of our key customers about local file systems and storage - customers
all have migrated over to using normal file systems under Oracle/DB2.
Typically, they use XFS or ext4.  I don't know of any non-standard file
systems and only have seen one account running on a raw block store in 8 years
:)

If you have a pre-allocated file and write using O_DIRECT, your IO path is
identical in terms of IO's sent to the device.

If we are causing additional IO's, then we really need to spend some time
talking to the local file system gurus about this in detail.  I can help with
that conversation.

If the file is truly preallocated (that is, prewritten with zeros...
fallocate doesn't help here because the extents is marked unwritten), then
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or
a few) huge files and the user space app already has all the complexity of
a filesystem-like thing (with its own internal journal, allocators,
garbage collection, etc.).  Do they just do this to ease administrative
tasks like backup?


I think that the key here is that if we fsync() like crazy - regardless of 
writing to a file system or to some new, yet to be define block device primitive 
store - we are limited to the IOP's of that particular block device.


Ignoring exotic hardware configs for people who can ignore all SSD devices, we 
will have rotating, high capacity, slow spinning drives for *a long time* as the 
eventual tier.  Given that assumption, we need to do better then to be limited 
to synchronous IOP's for a slow drive.  When we have commodity pricing for 
things like persistent DRAM, then I agree that writing directly to that medium 
makes sense (but you can do that with DAX by effectively mapping that into the 
process address space).


Specifically, moving from a file system with some inefficiencies will only boost 
performance from say 20-30 IOP's to roughly 40-50 IOP's.


The way this has been handled traditionally for things like databases, etc is:

* batch up the transactions that need to be destaged
* issue an O_DIRECT async IO for all of the elements that need to be written 
(bypassed the page cache, direct to the backing store)

* wait for completion

We should probably add to that sequence an fsync() of the directory (or a file 
in the file system) to insure that any volatile write cache is invalidated, but 
there is *no* reason to fsync() each file.


I think that we need to look at why the write pattern is so heavily synchronous 
and single threaded if we are hoping to extract from any given storage tier its 
maximum performance.


Doing this can raise your file creations per second (or allocations per second) 
from a few dozen to a few hundred or more per second.


The complexity that writing a new block level allocation strategy that you save 
is:

* if you lay out a lot of small objects on the block store that can grow, we 
will quickly end up doing very complicated techniques that file systems solved a 
long time ago (pre-allocation, etc)
* multi-stream aware allocation if you have multiple processes writing to the 
same store
* tracking things like allocated but unwritten (can happen if some process 
"pokes" a hole in an object, common with things like virtual machine images)


One we end up handling all of that in new, untested code, I think that we end up 
with a lot of pain and only minimal gain in terms of performance.


ric




This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that
there are two independent layers journaling and managing different types
of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file
system to work around what it is used to: we swap extents to avoid
write-ahead (see Christoph's patch), O_NOMTIME, unprivileged
open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that
lives within it (pretending the file is a block device).  The file system
rarely gets in the way (assuming the file is prewritten and we don't do
anything stupid).  But it doesn't give us anything a block device
wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.
And although 1b performs a bit better than 1, it has similar (user-space)
complexity to 2.  On the other hand, if you step back and view teh
entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex
than 2... and yet still slower.  Given we ultimately have to support both
(both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the
beaten path (1) to anything mildly exotic 

Re: newstore direction

2015-10-22 Thread Howard Chu
Milosz Tanski  adfin.com> writes:

> 
> On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil  redhat.com> wrote:
> > On Tue, 20 Oct 2015, John Spray wrote:
> >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  redhat.com> wrote:
> >> >  - We have to size the kv backend storage (probably still an XFS
> >> > partition) vs the block storage.  Maybe we do this anyway (put
metadata on
> >> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> >> > rgw index data or cephfs metadata?  Suddenly we are pulling storage
out of
> >> > a different pool and those aren't currently fungible.
> >>
> >> This is the concerning bit for me -- the other parts one "just" has to
> >> get the code right, but this problem could linger and be something we
> >> have to keep explaining to users indefinitely.  It reminds me of cases
> >> in other systems where users had to make an educated guess about inode
> >> size up front, depending on whether you're expecting to efficiently
> >> store a lot of xattrs.
> >>
> >> In practice it's rare for users to make these kinds of decisions well
> >> up-front: it really needs to be adjustable later, ideally
> >> automatically.  That could be pretty straightforward if the KV part
> >> was stored directly on block storage, instead of having XFS in the
> >> mix.  I'm not quite up with the state of the art in this area: are
> >> there any reasonable alternatives for the KV part that would consume
> >> some defined range of a block device from userspace, instead of
> >> sitting on top of a filesystem?
> >
> > I agree: this is my primary concern with the raw block approach.
> >
> > There are some KV alternatives that could consume block, but the problem
> > would be similar: we need to dynamically size up or down the kv portion of
> > the device.
> >
> > I see two basic options:
> >
> > 1) Wire into the Env abstraction in rocksdb to provide something just
> > smart enough to let rocksdb work.  It isn't much: named files (not that
> > many--we could easily keep the file table in ram), always written
> > sequentially, to be read later with random access. All of the code is
> > written around abstractions of SequentialFileWriter so that everything
> > posix is neatly hidden in env_posix (and there are various other env
> > implementations for in-memory mock tests etc.).
> >
> > 2) Use something like dm-thin to sit between the raw block device and XFS
> > (for rocksdb) and the block device consumed by newstore.  As long as XFS
> > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> > files in their entirety) we can fstrim and size down the fs portion.  If
> > we similarly make newstores allocator stick to large blocks only we would
> > be able to size down the block portion as well.  Typical dm-thin block
> > sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> > me.  In fact, we could likely just size the fs volume at something
> > conservatively large (like 90%) and rely on -o discard or periodic fstrim
> > to keep its actual utilization in check.
> >
> 
> I think you could prototype a raw block device OSD store using LMDB as
> a starting point. I know there's been some experiments using LMDB as
> KV store before with positive read numbers and not great write
> numbers.
> 
> 1. It mmaps, just mmap the raw disk device / partition. I've done this
> as an experiment before, I can dig up a patch for LMDB.
> 2. It already has a free space management strategy. I'm prob it's not
> right for the OSDs in the long term but there's something to start
> there with.
> 3. It's already supports transactions / COW.
> 4. LMDB isn't a huge code base so it might be a good place to start /
> evolve code from.
> 5. You're not starting a multi-year effort at the 0 point.
> 
> As to the not great write performance, that could be addressed by
> write transaction merging (what mysql implemented a few years ago).

We have a heavily hacked version of LMDB contributed by VMware that
implements a WAL. In my preliminary testing it performs synchronous writes
30x faster (on average) than current LMDB. Their version unfortunately
slashed'n'burned a lot of LMDB features that other folks actually need, so
we can't use it as-is. Currently working on rationalizing the approach and
merging it into mdb.master.

The reasons for the WAL approach:
  1) obviously sequential writes are cheaper than random writes.
  2) fsync() of a small log file will always be faster than fsync() of a
large DB. I.e., fsync() latency is proportional to the total number of pages
in the file, not just the number of dirty pages.

LMDB on a raw block device is a simpler proposition, and one we intend to
integrate soon as well. (Milosz, did you ever submit your changes?)

> Here you have an opportunity to do it two days. One, you can do it in
> the application layer while waiting for the fsync from transaction to
> complete. This is probably the easier route. Two, you can do it in the
> DB layer (the LMDB 

Re: newstore direction

2015-10-22 Thread Sage Weil
On Wed, 21 Oct 2015, Ric Wheeler wrote:
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.
> 
> If we are causing additional IO's, then we really need to spend some time
> talking to the local file system gurus about this in detail.  I can help with
> that conversation.

If the file is truly preallocated (that is, prewritten with zeros... 
fallocate doesn't help here because the extents is marked unwritten), then 
sure: there is very little change in the data path.

But at that point, what is the point?  This only works if you have one (or 
a few) huge files and the user space app already has all the complexity of 
a filesystem-like thing (with its own internal journal, allocators, 
garbage collection, etc.).  Do they just do this to ease administrative 
tasks like backup?


This is the fundamental tradeoff:

1) We have a file per object.  We fsync like crazy and the fact that 
there are two independent layers journaling and managing different types 
of consistency penalizes us.

1b) We get clever and start using obscure and/or custom ioctls in the file 
system to work around what it is used to: we swap extents to avoid 
write-ahead (see Christoph's patch), O_NOMTIME, unprivileged 
open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc.

2) We preallocate huge files and write a user-space object system that 
lives within it (pretending the file is a block device).  The file system 
rarely gets in the way (assuming the file is prewritten and we don't do 
anything stupid).  But it doesn't give us anything a block device 
wouldn't, and it doesn't save us any complexity in our code.

At the end of the day, 1 and 1b are always going to be slower than 2.  
And although 1b performs a bit better than 1, it has similar (user-space) 
complexity to 2.  On the other hand, if you step back and view teh 
entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex 
than 2... and yet still slower.  Given we ultimately have to support both 
(both as an upstream and as a distro), that's not very attractive.

Also note that every time we have strayed off the reservation from the 
beaten path (1) to anything mildly exotic (1b) we have been bitten by 
obscure file systems bugs.  And that's assume we get everything we need 
upstream... which is probably a year's endeavour.

Don't get me wrong: I'm all for making changes to file systems to better 
support systems like Ceph.  Things like O_NOCMTIME and O_ATOMIC make a 
huge amount of sense of a ton of different systems.  But our situations is 
a bit different: we always own the entire device (and often the server), 
so there is no need to share with other users or apps (and when you do, 
you just use the existing FileStore backend).  And as you know performance 
is a huge pain point.  We are already handicapped by virtue of being 
distributed and strongly consistent; we can't afford to give away more to 
a storage layer that isn't providing us much (or the right) value.

And I'm tired of half measures.  I want the OSD to be as fast as we can 
make it given the architectural constraints (RADOS consistency and 
ordering semantics).  This is truly low-hanging fruit: it's modular, 
self-contained, pluggable, and this will be my third time around this 
particular block.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-22 Thread Milosz Tanski
On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil  wrote:
> On Tue, 20 Oct 2015, John Spray wrote:
>> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
>> >  - We have to size the kv backend storage (probably still an XFS
>> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> > a different pool and those aren't currently fungible.
>>
>> This is the concerning bit for me -- the other parts one "just" has to
>> get the code right, but this problem could linger and be something we
>> have to keep explaining to users indefinitely.  It reminds me of cases
>> in other systems where users had to make an educated guess about inode
>> size up front, depending on whether you're expecting to efficiently
>> store a lot of xattrs.
>>
>> In practice it's rare for users to make these kinds of decisions well
>> up-front: it really needs to be adjustable later, ideally
>> automatically.  That could be pretty straightforward if the KV part
>> was stored directly on block storage, instead of having XFS in the
>> mix.  I'm not quite up with the state of the art in this area: are
>> there any reasonable alternatives for the KV part that would consume
>> some defined range of a block device from userspace, instead of
>> sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
>
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
>

I think you could prototype a raw block device OSD store using LMDB as
a starting point. I know there's been some experiments using LMDB as
KV store before with positive read numbers and not great write
numbers.

1. It mmaps, just mmap the raw disk device / partition. I've done this
as an experiment before, I can dig up a patch for LMDB.
2. It already has a free space management strategy. I'm prob it's not
right for the OSDs in the long term but there's something to start
there with.
3. It's already supports transactions / COW.
4. LMDB isn't a huge code base so it might be a good place to start /
evolve code from.
5. You're not starting a multi-year effort at the 0 point.

As to the not great write performance, that could be addressed by
write transaction merging (what mysql implemented a few years ago).
Here you have an opportunity to do it two days. One, you can do it in
the application layer while waiting for the fsync from transaction to
complete. This is probably the easier route. Two, you can do it in the
DB layer (the LMDB transaction handling / locking) where you're
already started processing the following transactions using the
currently committing transaction (COW) as a starting point. This is
harder mostly because of the synchronization needed or involved.

I've actually spend some time thinking about doing LMDB write
transaction merging outside the OSD context. This was for another
project.

My 2 cents.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Mark Nelson

On 10/21/2015 05:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a suboptimal 
implementation is something that you can live with compared to the longer gestation 
period of the more optimal implementation. Clearly, Sage believes that the performance 
difference is significant or he wouldn't have kicked off this discussion in the first 
place.

While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a 
significant amount of work. I will offer the case that the "loosely couple" 
scheme may not have as much time-to-market advantage as it appears to have. One example: 
NewStore performance is limited due to bugs in XFS that won't be fixed in the field for 
quite some time (it'll take at least a couple of years before a patched version of XFS 
will be widely deployed at customer environments).

Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:


Has there been any discussion regarding opensourcing zetascale?



(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).


If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some 
of them no-ops.



   - On read we have to open files by name, which means traversing the
fs namespace.  Newstore tries to keep it as flat and simple as
possible, but at a minimum it is a couple btree lookups.  We'd love to
use open by handle (which would reduce this to 1 btree traversal), but
running the daemon as ceph and not root makes that hard...


This seems like a a pretty low hurdle to overcome.



   - ...and file systems insist on updating mtime on writes, even when
it is a overwrite with no allocatio

Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 09:32 AM, Sage Weil wrote:

On Tue, 20 Oct 2015, Ric Wheeler wrote:

Now:
  1 io  to write a new file
1-2 ios to sync the fs journal (commit the inode, alloc change)
(I see 2 journal IOs on XFS and only 1 on ext4...)
  1 io  to commit the rocksdb journal (currently 3, but will drop to
1 with xfs fix and my rocksdb change)

I think that might be too pessimistic - the number of discrete IO's sent down
to a spinning disk make much less impact on performance than the number of
fsync()'s since they IO's all land in the write cache.  Some newer spinning
drives have a non-volatile write cache, so even an fsync() might not end up
doing the expensive data transfer to the platter.

True, but in XFS's case at least the file data and journal are not
colocated, so its 2 seeks for the new file write+fdatasync and another for
the rocksdb journal commit.  Of course, with a deep queue, we're doing
lots of these so there's be fewer journal commits on both counts, but the
lower bound on latency of a single write is still 3 seeks, and that bound
is pretty critical when you also have network round trips and replication
(worst out of 2) on top.


What are the performance goals we are looking for?

Small, synchronous writes/second?

File creates/second?

I suspect that looking at things like seeks/write is probably looking at the 
wrong level of performance challenges.  Again, when you write to a modern drive, 
you write to its write cache and it decides internally when/how to destage to 
the platter.


If you look at the performance of XFS with streaming workloads, it will tend to 
max out the bandwidth of the underlaying storage.


If we need IOP's/file writes, etc, we should be clear on what we are aiming at.




It would be interesting to get the timings on the IO's you see to measure the
actual impact.

I observed this with the journaling workload for rocksdb, but I assume the
journaling behavior is the same regardless of what is being journaled.
For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
blktrace showed an IO to the file, and 2 IOs to the journal.  I believe
the first one is the record for the inode update, and the second is the
journal 'commit' record (though I forget how I decided that).  My guess is
that XFS is being extremely careful about journal integrity here and not
writing the commit record until it knows that the preceding records landed
on stable storage.  For ext4, the latency was about ~20ms, and blktrace
showed the IO to the file and then a single journal IO.  When I made the
rocksdb change to overwrite an existing, prewritten file, the latency
dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
(XFS still showed the 2 journal commit IOs, but Dave just posted the fix
for that on the XFS list today.)


Right, if we want to avoid metadata related IO's, we can preallocate a file and 
use O_DIRECT. Effectively, there should be no updates outside of the data write 
itself.  Also won't be performance optimizations, but we could avoid redoing 
allocation and defragmentation again.


Normally, best practice is to use batching to avoid paying worst case latency 
when you do a synchronous IO. Write a batch of files or appends without fsync, 
then go back and fsync and you will pay that latency once (not per file/op).





Plumbing for T10 DIF/DIX already exist, what is missing is the normal block
device that handles them (not enterprise SAS/disk array class)

Yeah... which unfortunately means that unless the cheap drives
suddenly start shipping if DIF/DIX support we'll need to do the
checksums ourselves.  This is probably a good thing anyway as it doesn't
constrain our choice of checksum or checksum granularity, and will
still work with other storage devices (ssds, nvme, etc.).

sage


Might be interesting to see if a device mapper target could be written to 
support DIF/DIX.  For what it's worth, XFS developers have talked loosely about 
looking at data block checksums (could do something like btrfs does, store the 
checksums in another btree)


ric

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Chen, Xiaoxi
We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e 
re-invent an NVMKV, the final conclusion sounds like it's not hard with 
persistent memory(which will be available soon).  But yeah, NVMKV will not work 
if no PM is present---persist the hashing table to SSD is not practicable.   

Range query seems not a very big issue as the random read performance of 
nowadays SSD is more than enough, I mean, even we break all sequential to 
random (typically 70-80K IOPS which is ~300MB/s), the performance still good 
enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play 
well on SSDs from different vendors, would be better to leave it to SSD vendor, 
something like Openstack Cinder's structure.  a vendor has the responsibility 
to maintain their drivers to ceph and take care the performance.

> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
> 
> Thanks Allen!  The devil is always in the details.  Know of anything else that
> looks promising?
> 
> Mark
> 
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities of
> > the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com>
> > Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy
> > <somnath@sandisk.com>; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs, say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB or
> >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
> >>> vendor are also trying to build this kind of interface, we had a
> >>> NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-sized/aligned payloads.
> >> Storing just the metadata (block allocation map) w/ the kv api and
> >> storing the data directly on a block/page interface makes more sense to
> me.
> >>
> >> sage
> >
> > I get the feeling that some of the folks that were involved with nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
> instance.
> http://pmem.io might be a better bet, though I haven't looked closely at it.
> >
> > Mark
> >
> >>
> >>
> >>>> -Original Message-
> >>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> >>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> >>>> Sent: Tuesday, October 20, 2015 6:21 AM
> >>>> To: Sage Weil; Somnath Roy
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: RE: newstore direction
> >>>>
> >>>> Hi Sage and Somnath,
> >>>> In my humble opinion, There is another more aggressive
> >>>> solution than raw block device base keyvalue store as backend for
> >>>> objectstore. The new key value  SSD device with transaction support
> would be  ideal to solve the issues.
> >>>> First of all, it is raw SSD device. Secondly , It provides key
> >>

Re: newstore direction

2015-10-21 Thread Ric Wheeler
M trees experience exponential increase in write
amplification (cost of an insert) as the amount of data under
management increases. B+tree write-amplification is nearly constant
independent of the size of data under management. As the KV database
gets larger (Since newStore is effectively moving the per-file inode
into the kv data base. Don't forget checksums that Sage want's to add
:)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs
CPU time and disk accesses to page in data structure indexes, metadata
efficiency decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a
good argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are
you sure that each fsync() takes the same time? Depending on the local
FS implementation of course, but the order of issuing those fsync()'s
can effectively make some of them no-ops.


   - On read we have to open files by name, which means traversing the
fs namespace.  Newstore tries to keep it as flat and simple as
possible, but at a minimum it is a couple btree lookups. We'd love to
use open by handle (which would reduce this to 1 btree traversal), but
running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.


   - ...and file systems insist on updating mtime on writes, even when
it is a overwrite with no allocation changes.  (We don't care about
mtime.) O_NOCMTIME patches exist but it is hard to get these past the
kernel brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey
database tricks that we can use here.


   - XFS is (probably) never going going to give us data checksums,
which we want desperately.

What is the goal of having the file system do the checksums? How
strong do they need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO
(each write will possibly generate at least one other write to update
that new checksum).


But what's the alternative?  My thought is to just bite the bullet and
consume a raw block device directly.  Write an allocator, hopefully
keep it pretty simple, and manage it in kv store along with all of our
other metadata.

The big problem with consuming block devices directly is that you
ultimately end up recreating most of the features that you had in the
file system. Even enterprise databases like Oracle and DB2 have been
migrating away from running on raw block devices in favor of file
systems over time.  In effect, you are looking at making a simple on
disk file system which is always easier to start than it is to get
back to a stable, production ready state.

I think that it might be quicker and more maintainable to spend some
time working with the local file system people (XFS or other) to see
if we can jointly address the concerns you have.

Wins:

   - 2 IOs for most: one to write the data to unused space in the block
device, one to commit our transaction (vs 4+ before).  For overwrites,
we'd have one io to do our write-ahead log (kv journal), then do the
overwrite async (vs 4+ before).

   - No concern about mtime getting in the way

   - Faster reads (no fs lookup)

   - Similarly sized metadata for most objects.  If we assume most
objects are not fragmented, then the metadata to store the block
offsets is about the same size as the metadata to store the filenames
we have now.

Problems:

   - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put
metadata on
SSD!) so it won

Re: newstore direction

2015-10-21 Thread Mark Nelson
Thanks Allen!  The devil is always in the details.  Know of anything 
else that looks promising?


Mark

On 10/21/2015 05:06 AM, Allen Samuels wrote:

I doubt that NVMKV will be useful for two reasons:

(1) It relies on the unique sparse-mapping addressing capabilities of the 
FusionIO VSL interface, it won't run on standard SSDs
(2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range 
operations on keys). This is pretty much required for deep scrubbing.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, October 20, 2015 6:20 AM
To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com>
Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy 
<somnath@sandisk.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/20/2015 07:30 AM, Sage Weil wrote:

On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:

+1, nowadays K-V DB care more about very small key-value pairs, say
several bytes to a few KB, but in SSD case we only care about 4KB or
8KB. In this way, NVMKV is a good design and seems some of the SSD
vendor are also trying to build this kind of interface, we had a
NVM-L library but still under development.


Do you have an NVMKV link?  I see a paper and a stale github repo..
not sure if I'm looking at the right thing.

My concern with using a key/value interface for the object data is
that you end up with lots of key/value pairs (e.g., $inode_$offset =
$4kb_of_data) that is pretty inefficient to store and (depending on
the
implementation) tends to break alignment.  I don't think these
interfaces are targetted toward block-sized/aligned payloads.  Storing
just the metadata (block allocation map) w/ the kv api and storing the
data directly on a block/page interface makes more sense to me.

sage


I get the feeling that some of the folks that were involved with nvmkv at 
Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
instance.  http://pmem.io might be a better bet, though I haven't looked 
closely at it.

Mark





-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, October 20, 2015 6:21 AM
To: Sage Weil; Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

Hi Sage and Somnath,
In my humble opinion, There is another more aggressive  solution
than raw block device base keyvalue store as backend for
objectstore. The new key value  SSD device with transaction support would be  
ideal to solve the issues.
First of all, it is raw SSD device. Secondly , It provides key value
interface directly from SSD. Thirdly, it can provide transaction
support, consistency will be guaranteed by hardware device. It
pretty much satisfied all of objectstore needs without any extra
overhead since there is not any extra layer in between device and objectstore.
 Either way, I strongly support to have CEPH own data format
instead of relying on filesystem.

Regards,
James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 1:55 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, Somnath Roy wrote:

Sage,
I fully support that.  If we want to saturate SSDs , we need to get
rid of this filesystem overhead (which I am in process of measuring).
Also, it will be good if we can eliminate the dependency on the k/v
dbs (for storing allocators and all). The reason is the unknown
write amps they causes.


My hope is to keep behing the KeyValueDB interface (and/more change
it as
appropriate) so that other backends can be easily swapped in (e.g. a
btree- based one for high-end flash).

sage




Thanks & Regards
Somnath


-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 12:49 PM
To: ceph-devel@vger.kernel.org
Subject: newstore direction

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our
internal metadata (object metadata, attrs, layout, collection
membership, write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.
A few
things:

   - We currently write the data to the file, fsync, then commit the
kv transaction.  That's at least 3 IOs: one for the data, one for
the fs journal, one for the kv txn to commit (at least once my
rocksdb changes land... the kv commit is cu

Re: newstore direction

2015-10-21 Thread Mark Nelson
 swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs
CPU time and disk accesses to page in data structure indexes, metadata
efficiency decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a
good argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are
you sure that each fsync() takes the same time? Depending on the local
FS implementation of course, but the order of issuing those fsync()'s
can effectively make some of them no-ops.


   - On read we have to open files by name, which means traversing the
fs namespace.  Newstore tries to keep it as flat and simple as
possible, but at a minimum it is a couple btree lookups.  We'd love to
use open by handle (which would reduce this to 1 btree traversal), but
running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.


   - ...and file systems insist on updating mtime on writes, even when
it is a overwrite with no allocation changes.  (We don't care about
mtime.) O_NOCMTIME patches exist but it is hard to get these past the
kernel brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey
database tricks that we can use here.


   - XFS is (probably) never going going to give us data checksums,
which we want desperately.

What is the goal of having the file system do the checksums? How
strong do they need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO
(each write will possibly generate at least one other write to update
that new checksum).


But what's the alternative?  My thought is to just bite the bullet and
consume a raw block device directly.  Write an allocator, hopefully
keep it pretty simple, and manage it in kv store along with all of our
other metadata.

The big problem with consuming block devices directly is that you
ultimately end up recreating most of the features that you had in the
file system. Even enterprise databases like Oracle and DB2 have been
migrating away from running on raw block devices in favor of file
systems over time.  In effect, you are looking at making a simple on
disk file system which is always easier to start than it is to get
back to a stable, production ready state.

I think that it might be quicker and more maintainable to spend some
time working with the local file system people (XFS or other) to see
if we can jointly address the concerns you have.

Wins:

   - 2 IOs for most: one to write the data to unused space in the block
device, one to commit our transaction (vs 4+ before).  For overwrites,
we'd have one io to do our write-ahead log (kv journal), then do the
overwrite async (vs 4+ before).

   - No concern about mtime getting in the way

   - Faster reads (no fs lookup)

   - Similarly sized metadata for most objects.  If we assume most
objects are not fragmented, then the metadata to store the block
offsets is about the same size as the metadata to store the filenames
we have now.

Problems:

   - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put
metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs
of rgw index data or cephfs metadata?  Suddenly we are pulling storage
out of a different pool and those aren't currently fungible.

   - We have to write and maintain an allocator.  I'm still optimistic
this can be reasonbly simple, especially for the flash case (where
fragmentation isn't such an issue as long as our blocks are reasonbly
sized).  For disk

Re: newstore direction

2015-10-21 Thread Sage Weil
On Wed, 21 Oct 2015, Ric Wheeler wrote:
> On 10/21/2015 04:22 AM, Orit Wasserman wrote:
> > On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> > > On 10/19/2015 03:49 PM, Sage Weil wrote:
> > > > The current design is based on two simple ideas:
> > > > 
> > > >1) a key/value interface is better way to manage all of our internal
> > > > metadata (object metadata, attrs, layout, collection membership,
> > > > write-ahead logging, overlay data, etc.)
> > > > 
> > > >2) a file system is well suited for storage object data (as files).
> > > > 
> > > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > > few
> > > > things:
> > > > 
> > > >- We currently write the data to the file, fsync, then commit the kv
> > > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > > journal, one for the kv txn to commit (at least once my rocksdb changes
> > > > land... the kv commit is currently 2-3).  So two people are managing
> > > > metadata, here: the fs managing the file metadata (with its own
> > > > journal) and the kv backend (with its journal).
> > > If all of the fsync()'s fall into the same backing file system, are you
> > > sure
> > > that each fsync() takes the same time? Depending on the local FS
> > > implementation
> > > of course, but the order of issuing those fsync()'s can effectively make
> > > some of
> > > them no-ops.
> > > 
> > > >- On read we have to open files by name, which means traversing the
> > > > fs
> > > > namespace.  Newstore tries to keep it as flat and simple as possible,
> > > > but
> > > > at a minimum it is a couple btree lookups.  We'd love to use open by
> > > > handle (which would reduce this to 1 btree traversal), but running
> > > > the daemon as ceph and not root makes that hard...
> > > This seems like a a pretty low hurdle to overcome.
> > > 
> > > >- ...and file systems insist on updating mtime on writes, even when
> > > > it is
> > > > a overwrite with no allocation changes.  (We don't care about mtime.)
> > > > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > > > brainfreeze.
> > > Are you using O_DIRECT? Seems like there should be some enterprisey
> > > database
> > > tricks that we can use here.
> > > 
> > > >- XFS is (probably) never going going to give us data checksums,
> > > > which we
> > > > want desperately.
> > > What is the goal of having the file system do the checksums? How strong do
> > > they
> > > need to be and what size are the chunks?
> > > 
> > > If you update this on each IO, this will certainly generate more IO (each
> > > write
> > > will possibly generate at least one other write to update that new
> > > checksum).
> > > 
> > > > But what's the alternative?  My thought is to just bite the bullet and
> > > > consume a raw block device directly.  Write an allocator, hopefully keep
> > > > it pretty simple, and manage it in kv store along with all of our other
> > > > metadata.
> > > The big problem with consuming block devices directly is that you
> > > ultimately end
> > > up recreating most of the features that you had in the file system. Even
> > > enterprise databases like Oracle and DB2 have been migrating away from
> > > running
> > > on raw block devices in favor of file systems over time.  In effect, you
> > > are
> > > looking at making a simple on disk file system which is always easier to
> > > start
> > > than it is to get back to a stable, production ready state.
> > The best performance is still on block device (SAN).
> > File system simplify the operation tasks which worth the performance
> > penalty for a database. I think in a storage system this is not the
> > case.
> > In many cases they can use their own file system that is tailored for
> > the database.
> 
> You will have to trust me on this as the Red Hat person who spoke to pretty
> much all of our key customers about local file systems and storage - customers
> all have migrated over to using normal file systems under Oracle/DB2.
> Typically, they use XFS or ext4.  I don't know of any non-standard file
> systems and only have seen one account running on a raw block store in 8 years
> :)
> 
> If you have a pre-allocated file and write using O_DIRECT, your IO path is
> identical in terms of IO's sent to the device.

...except it's not.  Preallocating the file gives you contiguous space, 
but you still have to mark the extent written (not zero/prealloc).  The 
only way to get an identical IO pattern is to *pre-write* zeros (or 
whatever) to the file... which is hours on modern HDDs.

Ted asked for a way to force prealloc to expose preexisting disk bits a 
couple hears back at LSF and it was shot down for security reasons (and 
rightly so, IMO).

If you're going down this path, you already have a "file system" in user 
space sitting on top of the preallocated file, and you could just as 
easily use the block device directly.

If you're not, then you're writing smaller files (e.g., 

Re: newstore direction

2015-10-21 Thread Mark Nelson
5-10/msg00545.html

rolled out into RHEL/CentOS/Ubuntu.  I have no idea how long these
things typically take, but this might be a good test case.


How quickly things land in a distro is up to the interested parties
making the case for it.


My thought is that there is some inflection point where the userland 
kvstore/block approach is going to be less work, for everyone I think, 
than trying to quickly discover, understand, fix, and push upstream 
patches that sometimes only really benefit us.  I don't know if we've 
truly hit that that point, but it's tough for me to find flaws with 
Sage's argument.




Ric





Regards,

Ric



Another example: Sage has just had to substantially rework the
journaling code of rocksDB.

In short, as you can tell, I'm full throated in favor of going down
the optimal route.

Internally at Sandisk, we have a KV store that is optimized for flash
(it's called ZetaScale). We have extended it with a raw block
allocator just as Sage is now proposing to do. Our internal
performance measurements show a significant advantage over the current
NewStore. That performance advantage stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree
(levelDB/RocksDB). LSM trees experience exponential increase in write
amplification (cost of an insert) as the amount of data under
management increases. B+tree write-amplification is nearly constant
independent of the size of data under management. As the KV database
gets larger (Since newStore is effectively moving the per-file inode
into the kv data base. Don't forget checksums that Sage want's to add
:)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs
CPU time and disk accesses to page in data structure indexes, metadata
efficiency decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a
good argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our
internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit
the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are
you sure that each fsync() takes the same time? Depending on the local
FS implementation of course, but the order of issuing those fsync()'s
can effectively make some of them no-ops.


   - On read we have to open files by name, which means traversing the
fs namespace.  Newstore tries to keep it as flat and simple as
possible, but at a minimum it is a couple btree lookups. We'd love to
use open by handle (which would reduce this to 1 btree traversal), but
running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.


   - ...and file systems insist on updating mtime on writes, even when
it is a overwrite with no allocation changes.  (We don't care about
mtime.) O_NOCMTIME patches exist but it is hard to get these past the
kernel brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey
database tricks that we can use here.


   - XFS is (probably) never going going to give us data checksums,
which we want desperately.

What is the goal of having the file system do the checksums? How
strong do they need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO
(each write will possibly generate at least one other write to update
that new checksum).


But what's the alternative?  My thought is to just bite the bullet and
consume a raw block device directly.  Write an allocator, hopefully
keep it pretty simple, and manage it in kv store along with all of our
other metadata.

The big problem with consuming block devices directly is that you
ultimately end up recreating most of the features t

Re: newstore direction

2015-10-21 Thread Martin Millnert
Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland 
> kvstore/block approach is going to be less work, for everyone I think, 
> than trying to quickly discover, understand, fix, and push upstream 
> patches that sometimes only really benefit us.  I don't know if we've 
> truly hit that that point, but it's tough for me to find flaws with 
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are
further aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped
(multiple approaches exist) userland networking, which for packet
management has the benefit of - for very, very specific applications of
networking code - avoiding e.g. per-packet context switches etc, and
streamlining processor cache management performance. People have gone as
far as removing CPU cores from CPU scheduler to completely dedicate them
to the networking task at hand (cache optimizations). There are various
latency/throughput (bulking) optimizations applicable, but at the end of
the day, it's about keeping the CPU bus busy with "revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for
context switches to ever appear as a problem in themselves, certainly
for slower SSDs and HDDs. However, when going for truly high performance
IO, *every* hurdle in the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the
networking, per-packet handling characteristics).  Now, I'm not really
suggesting memory-mapping a storage device to user space, not at all,
but having better control over the data path for a very specific use
case, reduces dependency on the code that works as best as possible for
the general case, and allows for very purpose-built code, to address a
narrow set of requirements. ("Ceph storage cluster backend" isn't a
typical FS use case.) It also decouples dependencies on users i.e.
waiting for the next distro release before being able to take up the
benefits of improvements to the storage code.

A random google came up with related data on where "doing something way
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.html 

I (FWIW) certainly agree there is merit to the idea.
The scientific approach here could perhaps be to simply enumerate all
corner cases of "generic FS" that actually are cause for the experienced
issues, and assess probability of them being solved (and if so when).
That *could* improve chances of approaching consensus which wouldn't
hurt I suppose?

BR,
Martin

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I am pushing internally to open-source ZetaScale. Recent events may or may not 
affect that trajectory -- stay tuned.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com] 
Sent: Wednesday, October 21, 2015 10:45 PM
To: Allen Samuels <allen.samu...@sandisk.com>; Ric Wheeler 
<rwhee...@redhat.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 05:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.
>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).
>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:

Has there been any discussion regarding opensourcing zetascale?

>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
> Sent: Tuesday, October 20, 2015 11:32 AM
> To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>> The current design is based on two simple ideas:
>>
>>1) a key/value interface is better way to manage all of our 
>> internal metadata (object metadata, attrs, layout, collection 
>> membership, write-ahead logging, overlay data, etc.)
>>
>>2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  
>> A few
>> things:
>>
>>- We currently write the data to the file, fsync, then commit the 
>> kv transaction.  That's

RE: newstore direction

2015-10-21 Thread Allen Samuels
One of the biggest changes that flash is making in the storage world is that 
the way basic trade-offs in storage management software architecture are being 
affected. In the HDD world CPU time per IOP was relatively inconsequential, 
i.e., it had little effect on overall performance which was limited by the 
physics of the hard drive. Flash is now inverting that situation. When you look 
at the performance levels being delivered in the latest generation of NVMe SSDs 
you rapidly see that that storage itself is generally no longer the bottleneck 
(speaking about BW, not latency of course) but rather it's the system sitting 
in front of the storage that is the bottleneck. Generally it's the CPU cost of 
an IOP.

When Sandisk first starting working with Ceph (Dumpling) the design of librados 
and the OSD lead to the situation that the CPU cost of an IOP was dominated by 
context switches and network socket handling. Over time, much of that has been 
addressed. The socket handling code has been re-written (more than once!) some 
of the internal queueing in the OSD (and the associated context switches) have 
been eliminated. As the CPU costs have dropped, performance on flash has 
improved accordingly.

Because we didn't want to completely re-write the OSD (time-to-market and 
stability drove that decision), we didn't move it from the current "thread per 
IOP" model into a truly asynchronous "thread per CPU core" model that 
essentially eliminates context switches in the IO path. But a fully optimized 
OSD would go down that path (at least part-way). I believe it's been proposed 
in the past. Perhaps a hybrid "fast-path" style could get most of the benefits 
while preserving much of the legacy code.

I believe this trend toward thread-per-core software development will also tend 
to support the "do it in user-space" trend. That's because most of the kernel 
and file-system interface is architected around the blocking "thread-per-IOP" 
model and is unlikely to change in the future.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Martin Millnert [mailto:mar...@millnert.se]
Sent: Thursday, October 22, 2015 6:20 AM
To: Mark Nelson <mnel...@redhat.com>
Cc: Ric Wheeler <rwhee...@redhat.com>; Allen Samuels 
<allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

Adding 2c

On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote:
> My thought is that there is some inflection point where the userland
> kvstore/block approach is going to be less work, for everyone I think,
> than trying to quickly discover, understand, fix, and push upstream
> patches that sometimes only really benefit us.  I don't know if we've
> truly hit that that point, but it's tough for me to find flaws with
> Sage's argument.

Regarding the userland / kernel land aspect of the topic, there are further 
aspects AFAIK not yet addressed in the thread:
In the networking world, there's been development on memory mapped (multiple 
approaches exist) userland networking, which for packet management has the 
benefit of - for very, very specific applications of networking code - avoiding 
e.g. per-packet context switches etc, and streamlining processor cache 
management performance. People have gone as far as removing CPU cores from CPU 
scheduler to completely dedicate them to the networking task at hand (cache 
optimizations). There are various latency/throughput (bulking) optimizations 
applicable, but at the end of the day, it's about keeping the CPU bus busy with 
"revenue" bus traffic.

Granted, storage IO operations may be much heavier in cycle counts for context 
switches to ever appear as a problem in themselves, certainly for slower SSDs 
and HDDs. However, when going for truly high performance IO, *every* hurdle in 
the data path counts toward the total latency.
(And really, high performance random IO characteristics approaches the 
networking, per-packet handling characteristics).  Now, I'm not really 
suggesting memory-mapping a storage device to user space, not at all, but 
having better control over the data path for a very specific use case, reduces 
dependency on the code that works as best as possible for the general case, and 
allows for very purpose-built code, to address a narrow set of requirements. 
("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples 
dependencies on users i.e.
waiting for the next distro release before being able to take up the benefits 
of improvements to the storage code.

A random google came up with related data on where "doing something way 
different" /can/ have significant benefits:
http://phunq.net/pipermail/tux3/2015-April/002147.ht

RE: newstore direction

2015-10-21 Thread Allen Samuels
Fixing the bug doesn't take a long time. Getting it deployed is where the delay 
is. Many companies standardize on a particular release of a particular distro. 
Getting them to switch to a new release -- even a "bug fix" point release -- is 
a major undertaking that often is a complete roadblock. Just my experience. 
YMMV. 


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com] 
Sent: Wednesday, October 21, 2015 8:24 PM
To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction



On 10/21/2015 06:06 AM, Allen Samuels wrote:
> I agree that moving newStore to raw block is going to be a significant 
> development effort. But the current scheme of using a KV store combined with 
> a normal file system is always going to be problematic (FileStore or 
> NewStore). This is caused by the transactional requirements of the 
> ObjectStore interface, essentially you need to make transactionally 
> consistent updates to two indexes, one of which doesn't understand 
> transactions (File Systems) and can never be tightly-connected to the other 
> one.
>
> You'll always be able to make this "loosely coupled" approach work, but it 
> will never be optimal. The real question is whether the performance 
> difference of a suboptimal implementation is something that you can live with 
> compared to the longer gestation period of the more optimal implementation. 
> Clearly, Sage believes that the performance difference is significant or he 
> wouldn't have kicked off this discussion in the first place.

I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.

We need some deep analysis with some local file system experts thrown in to 
validate the concerns.

>
> While I think we can all agree that writing a full-up KV and raw-block 
> ObjectStore is a significant amount of work. I will offer the case that the 
> "loosely couple" scheme may not have as much time-to-market advantage as it 
> appears to have. One example: NewStore performance is limited due to bugs in 
> XFS that won't be fixed in the field for quite some time (it'll take at least 
> a couple of years before a patched version of XFS will be widely deployed at 
> customer environments).

Not clear what bugs you are thinking of or why you think fixing bugs will take 
a long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.

Never seen a "bug" take a couple of years to hit users.

Regards,

Ric

>
> Another example: Sage has just had to substantially rework the journaling 
> code of rocksDB.
>
> In short, as you can tell, I'm full throated in favor of going down the 
> optimal route.
>
> Internally at Sandisk, we have a KV store that is optimized for flash (it's 
> called ZetaScale). We have extended it with a raw block allocator just as 
> Sage is now proposing to do. Our internal performance measurements show a 
> significant advantage over the current NewStore. That performance advantage 
> stems primarily from two things:
>
> (1) ZetaScale uses a B+-tree internally rather than an LSM tree 
> (levelDB/RocksDB). LSM trees experience exponential increase in write 
> amplification (cost of an insert) as the amount of data under management 
> increases. B+tree write-amplification is nearly constant independent of the 
> size of data under management. As the KV database gets larger (Since newStore 
> is effectively moving the per-file inode into the kv data base. Don't forget 
> checksums that Sage want's to add :)) this performance delta swamps all 
> others.
> (2) Having a KV and a file-system causes a double lookup. This costs CPU time 
> and disk accesses to page in data structure indexes, metadata efficiency 
> decreases.
>
> You can't avoid (2) as long as you're using a file system.
>
> Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
> argument for keeping the KV module pluggable.
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
>
> -Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org]

RE: newstore direction

2015-10-21 Thread Allen Samuels
Actually Range queries are an important part of the performance story and 
random read speed doesn't really solve the problem.

When you're doing a scrub, you need to enumerate the objects in a specific 
order on multiple nodes -- so that they can compare the contents of their 
stores in order to determine if data cleaning needs to take place.

If you don't have in-order enumeration in your basic data structure (which 
NVMKV doesn't have) then you're forced to sort the directory before you can 
respond to an enumeration. That sort will either consume huge amounts of IOPS 
OR huge amounts of DRAM. Regardless of the choice, you'll see a significant 
degradation of performance while the scrub is ongoing -- which is one of the 
biggest problems with clustered systems (expensive and extensive maintenance 
operations).


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com]
Sent: Thursday, October 22, 2015 1:10 AM
To: Mark Nelson <mnel...@redhat.com>; Allen Samuels 
<allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>
Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy 
<somnath@sandisk.com>; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e 
re-invent an NVMKV, the final conclusion sounds like it's not hard with 
persistent memory(which will be available soon).  But yeah, NVMKV will not work 
if no PM is present---persist the hashing table to SSD is not practicable.

Range query seems not a very big issue as the random read performance of 
nowadays SSD is more than enough, I mean, even we break all sequential to 
random (typically 70-80K IOPS which is ~300MB/s), the performance still good 
enough.

Anyway,  I think for the high IOPS case, it's hard for the consumer to play 
well on SSDs from different vendors, would be better to leave it to SSD vendor, 
something like Openstack Cinder's structure.  a vendor has the responsibility 
to maintain their drivers to ceph and take care the performance.

> -Original Message-
> From: Mark Nelson [mailto:mnel...@redhat.com]
> Sent: Wednesday, October 21, 2015 9:36 PM
> To: Allen Samuels; Sage Weil; Chen, Xiaoxi
> Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: newstore direction
>
> Thanks Allen!  The devil is always in the details.  Know of anything
> else that looks promising?
>
> Mark
>
> On 10/21/2015 05:06 AM, Allen Samuels wrote:
> > I doubt that NVMKV will be useful for two reasons:
> >
> > (1) It relies on the unique sparse-mapping addressing capabilities
> > of the FusionIO VSL interface, it won't run on standard SSDs
> > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no
> range operations on keys). This is pretty much required for deep scrubbing.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Tuesday, October 20, 2015 6:20 AM
> > To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi
> > <xiaoxi.c...@intel.com>
> > Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy
> > <somnath@sandisk.com>; ceph-devel@vger.kernel.org
> > Subject: Re: newstore direction
> >
> > On 10/20/2015 07:30 AM, Sage Weil wrote:
> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> >>> +1, nowadays K-V DB care more about very small key-value pairs,
> >>> +say
> >>> several bytes to a few KB, but in SSD case we only care about 4KB
> >>> or 8KB. In this way, NVMKV is a good design and seems some of the
> >>> SSD vendor are also trying to build this kind of interface, we had
> >>> a NVM-L library but still under development.
> >>
> >> Do you have an NVMKV link?  I see a paper and a stale github repo..
> >> not sure if I'm looking at the right thing.
> >>
> >> My concern with using a key/value interface for the object data is
> >> that you end up with lots of key/value pairs (e.g., $inode_$offset
> >> =
> >> $4kb_of_data) that is pretty inefficient to store and (depending on
> >> the
> >> implementation) tends to break alignment.  I don't think these
> >> interfaces are targetted toward block-

Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 08:53 PM, Allen Samuels wrote:

Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many 
companies standardize on a particular release of a particular distro. Getting them to 
switch to a new release -- even a "bug fix" point release -- is a major 
undertaking that often is a complete roadblock. Just my experience. YMMV.



Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).


If someone is deploying a critical server for storage, then it falls back on the 
storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).


ric

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I agree. My only point was that you still have to factor this time into the 
argument that by continuing to put NewStore on top of a file system you'll get 
to a stable system much sooner than the longer development path of doing your 
own raw storage allocator. IMO, once you factor that into the equation the "on 
top of an FS" path doesn't look like such a clear winner.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: Ric Wheeler [mailto:rwhee...@redhat.com]
Sent: Thursday, October 22, 2015 10:17 AM
To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; 
ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/21/2015 08:53 PM, Allen Samuels wrote:
> Fixing the bug doesn't take a long time. Getting it deployed is where the 
> delay is. Many companies standardize on a particular release of a particular 
> distro. Getting them to switch to a new release -- even a "bug fix" point 
> release -- is a major undertaking that often is a complete roadblock. Just my 
> experience. YMMV.
>

Customers do control the pace that they upgrade their machines, but we put out 
fixes on a very regular pace.  A lot of customers will get fixes without having 
to qualify a full new release (i.e., fixes come out between major and minor 
releases are easy).

If someone is deploying a critical server for storage, then it falls back on 
the storage software team to help guide them and encourage them to update when 
needed (and no promises of success, but people move if the win is big. If it is 
not, they can wait).

ric




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-21 Thread Sage Weil
On Tue, 20 Oct 2015, Ric Wheeler wrote:
> > Now:
> >  1 io  to write a new file
> >1-2 ios to sync the fs journal (commit the inode, alloc change)
> >(I see 2 journal IOs on XFS and only 1 on ext4...)
> >  1 io  to commit the rocksdb journal (currently 3, but will drop to
> >1 with xfs fix and my rocksdb change)
> 
> I think that might be too pessimistic - the number of discrete IO's sent down
> to a spinning disk make much less impact on performance than the number of
> fsync()'s since they IO's all land in the write cache.  Some newer spinning
> drives have a non-volatile write cache, so even an fsync() might not end up
> doing the expensive data transfer to the platter.

True, but in XFS's case at least the file data and journal are not 
colocated, so its 2 seeks for the new file write+fdatasync and another for 
the rocksdb journal commit.  Of course, with a deep queue, we're doing 
lots of these so there's be fewer journal commits on both counts, but the 
lower bound on latency of a single write is still 3 seeks, and that bound 
is pretty critical when you also have network round trips and replication 
(worst out of 2) on top.

> It would be interesting to get the timings on the IO's you see to measure the
> actual impact.

I observed this with the journaling workload for rocksdb, but I assume the 
journaling behavior is the same regardless of what is being journaled.  
For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and 
blktrace showed an IO to the file, and 2 IOs to the journal.  I believe 
the first one is the record for the inode update, and the second is the 
journal 'commit' record (though I forget how I decided that).  My guess is 
that XFS is being extremely careful about journal integrity here and not 
writing the commit record until it knows that the preceding records landed 
on stable storage.  For ext4, the latency was about ~20ms, and blktrace 
showed the IO to the file and then a single journal IO.  When I made the 
rocksdb change to overwrite an existing, prewritten file, the latency 
dropped to ~10ms on ext4, and blktrace showed a single IO as expected.  
(XFS still showed the 2 journal commit IOs, but Dave just posted the fix 
for that on the XFS list today.)

> Plumbing for T10 DIF/DIX already exist, what is missing is the normal block
> device that handles them (not enterprise SAS/disk array class)

Yeah... which unfortunately means that unless the cheap drives 
suddenly start shipping if DIF/DIX support we'll need to do the 
checksums ourselves.  This is probably a good thing anyway as it doesn't 
constrain our choice of checksum or checksum granularity, and will 
still work with other storage devices (ssds, nvme, etc.).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-21 Thread Allen Samuels
I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a 
suboptimal implementation is something that you can live with compared to the 
longer gestation period of the more optimal implementation. Clearly, Sage 
believes that the performance difference is significant or he wouldn't have 
kicked off this discussion in the first place.

While I think we can all agree that writing a full-up KV and raw-block 
ObjectStore is a significant amount of work. I will offer the case that the 
"loosely couple" scheme may not have as much time-to-market advantage as it 
appears to have. One example: NewStore performance is limited due to bugs in 
XFS that won't be fixed in the field for quite some time (it'll take at least a 
couple of years before a patched version of XFS will be widely deployed at 
customer environments).

Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
>
>   1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>   2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> few
> things:
>
>   - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb
> changes land... the kv commit is currently 2-3).  So two people are
> managing metadata, here: the fs managing the file metadata (with its
> own
> journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some 
of them no-ops.

>
>   - On read we have to open files by name, which means traversing the
> fs namespace.  Newstore tries to keep it as flat and simple as
> possible, but at a minimum it is a couple btree lookups.  We'd love to
> use open by handle (which would reduce this to 1 btree traversal), but
> running the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.

>
>   - ...and file systems insist on updating mtime on writes, even when
> it is a overwrite with no

RE: newstore direction

2015-10-21 Thread Allen Samuels
I doubt that NVMKV will be useful for two reasons:

(1) It relies on the unique sparse-mapping addressing capabilities of the 
FusionIO VSL interface, it won't run on standard SSDs
(2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range 
operations on keys). This is pretty much required for deep scrubbing.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, October 20, 2015 6:20 AM
To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com>
Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy 
<somnath@sandisk.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/20/2015 07:30 AM, Sage Weil wrote:
> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>> +1, nowadays K-V DB care more about very small key-value pairs, say
>> several bytes to a few KB, but in SSD case we only care about 4KB or
>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>> vendor are also trying to build this kind of interface, we had a
>> NVM-L library but still under development.
>
> Do you have an NVMKV link?  I see a paper and a stale github repo..
> not sure if I'm looking at the right thing.
>
> My concern with using a key/value interface for the object data is
> that you end up with lots of key/value pairs (e.g., $inode_$offset =
> $4kb_of_data) that is pretty inefficient to store and (depending on
> the
> implementation) tends to break alignment.  I don't think these
> interfaces are targetted toward block-sized/aligned payloads.  Storing
> just the metadata (block allocation map) w/ the kv api and storing the
> data directly on a block/page interface makes more sense to me.
>
> sage

I get the feeling that some of the folks that were involved with nvmkv at 
Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for 
instance.  http://pmem.io might be a better bet, though I haven't looked 
closely at it.

Mark

>
>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>> To: Sage Weil; Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> Hi Sage and Somnath,
>>>In my humble opinion, There is another more aggressive  solution
>>> than raw block device base keyvalue store as backend for
>>> objectstore. The new key value  SSD device with transaction support would 
>>> be  ideal to solve the issues.
>>> First of all, it is raw SSD device. Secondly , It provides key value
>>> interface directly from SSD. Thirdly, it can provide transaction
>>> support, consistency will be guaranteed by hardware device. It
>>> pretty much satisfied all of objectstore needs without any extra
>>> overhead since there is not any extra layer in between device and 
>>> objectstore.
>>> Either way, I strongly support to have CEPH own data format
>>> instead of relying on filesystem.
>>>
>>>Regards,
>>>James
>>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>>> Sent: Monday, October 19, 2015 1:55 PM
>>> To: Somnath Roy
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: RE: newstore direction
>>>
>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>> Sage,
>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>> dbs (for storing allocators and all). The reason is the unknown
>>>> write amps they causes.
>>>
>>> My hope is to keep behing the KeyValueDB interface (and/more change
>>> it as
>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>> btree- based one for high-end flash).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>>
>>>> -Original Message-
>>>> From: ceph-devel-ow...@vger.kernel.org
>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behal

Re: newstore direction

2015-10-21 Thread Ric Wheeler



On 10/21/2015 06:06 AM, Allen Samuels wrote:

I agree that moving newStore to raw block is going to be a significant 
development effort. But the current scheme of using a KV store combined with a 
normal file system is always going to be problematic (FileStore or NewStore). 
This is caused by the transactional requirements of the ObjectStore interface, 
essentially you need to make transactionally consistent updates to two indexes, 
one of which doesn't understand transactions (File Systems) and can never be 
tightly-connected to the other one.

You'll always be able to make this "loosely coupled" approach work, but it will 
never be optimal. The real question is whether the performance difference of a suboptimal 
implementation is something that you can live with compared to the longer gestation 
period of the more optimal implementation. Clearly, Sage believes that the performance 
difference is significant or he wouldn't have kicked off this discussion in the first 
place.


I think that we need to work with the existing stack - measure and do some 
collaborative analysis - before we throw out decades of work.  Very hard to 
understand why the local file system is a barrier for performance in this case 
when it is not an issue in existing enterprise applications.


We need some deep analysis with some local file system experts thrown in to 
validate the concerns.




While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a 
significant amount of work. I will offer the case that the "loosely couple" 
scheme may not have as much time-to-market advantage as it appears to have. One example: 
NewStore performance is limited due to bugs in XFS that won't be fixed in the field for 
quite some time (it'll take at least a couple of years before a patched version of XFS 
will be widely deployed at customer environments).


Not clear what bugs you are thinking of or why you think fixing bugs will take a 
long time to hit the field in XFS. Red Hat has most of the XFS developers on 
staff and we actively backport fixes and ship them, other distros do as well.


Never seen a "bug" take a couple of years to hit users.

Regards,

Ric



Another example: Sage has just had to substantially rework the journaling code 
of rocksDB.

In short, as you can tell, I'm full throated in favor of going down the optimal 
route.

Internally at Sandisk, we have a KV store that is optimized for flash (it's 
called ZetaScale). We have extended it with a raw block allocator just as Sage 
is now proposing to do. Our internal performance measurements show a 
significant advantage over the current NewStore. That performance advantage 
stems primarily from two things:

(1) ZetaScale uses a B+-tree internally rather than an LSM tree 
(levelDB/RocksDB). LSM trees experience exponential increase in write 
amplification (cost of an insert) as the amount of data under management 
increases. B+tree write-amplification is nearly constant independent of the 
size of data under management. As the KV database gets larger (Since newStore 
is effectively moving the per-file inode into the kv data base. Don't forget 
checksums that Sage want's to add :)) this performance delta swamps all others.
(2) Having a KV and a file-system causes a double lookup. This costs CPU time 
and disk accesses to page in data structure indexes, metadata efficiency 
decreases.

You can't avoid (2) as long as you're using a file system.

Yes an LSM tree performs better on HDD than does a B-tree, which is a good 
argument for keeping the KV module pluggable.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samu...@sandisk.com

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler
Sent: Tuesday, October 20, 2015 11:32 AM
To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: newstore direction

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure 
that each fs

Re: newstore direction

2015-10-21 Thread Ric Wheeler

On 10/21/2015 04:22 AM, Orit Wasserman wrote:

On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb changes
land... the kv commit is currently 2-3).  So two people are managing
metadata, here: the fs managing the file metadata (with its own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure
that each fsync() takes the same time? Depending on the local FS implementation
of course, but the order of issuing those fsync()'s can effectively make some of
them no-ops.


   - On read we have to open files by name, which means traversing the fs
namespace.  Newstore tries to keep it as flat and simple as possible, but
at a minimum it is a couple btree lookups.  We'd love to use open by
handle (which would reduce this to 1 btree traversal), but running
the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.


   - ...and file systems insist on updating mtime on writes, even when it is
a overwrite with no allocation changes.  (We don't care about mtime.)
O_NOCMTIME patches exist but it is hard to get these past the kernel
brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey database
tricks that we can use here.


   - XFS is (probably) never going going to give us data checksums, which we
want desperately.

What is the goal of having the file system do the checksums? How strong do they
need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO (each write
will possibly generate at least one other write to update that new checksum).


But what's the alternative?  My thought is to just bite the bullet and
consume a raw block device directly.  Write an allocator, hopefully keep
it pretty simple, and manage it in kv store along with all of our other
metadata.

The big problem with consuming block devices directly is that you ultimately end
up recreating most of the features that you had in the file system. Even
enterprise databases like Oracle and DB2 have been migrating away from running
on raw block devices in favor of file systems over time.  In effect, you are
looking at making a simple on disk file system which is always easier to start
than it is to get back to a stable, production ready state.

The best performance is still on block device (SAN).
File system simplify the operation tasks which worth the performance
penalty for a database. I think in a storage system this is not the
case.
In many cases they can use their own file system that is tailored for
the database.


You will have to trust me on this as the Red Hat person who spoke to pretty much 
all of our key customers about local file systems and storage - customers all 
have migrated over to using normal file systems under Oracle/DB2. Typically, 
they use XFS or ext4.  I don't know of any non-standard file systems and only 
have seen one account running on a raw block store in 8 years :)


If you have a pre-allocated file and write using O_DIRECT, your IO path is 
identical in terms of IO's sent to the device.


If we are causing additional IO's, then we really need to spend some time 
talking to the local file system gurus about this in detail.  I can help with 
that conversation.





I think that it might be quicker and more maintainable to spend some time
working with the local file system people (XFS or other) to see if we can
jointly address the concerns you have.

Wins:

   - 2 IOs for most: one to write the data to unused space in the block
device, one to commit our transaction (vs 4+ before).  For overwrites,
we'd have one io to do our write-ahead log (kv journal), then do
the overwrite async (vs 4+ before).

   - No concern about mtime getting in the way

   - Faster reads (no fs lookup)

   - Similarly sized metadata for most objects.  If we assume most objects
are not fragmented, then the metadata to store the block offsets is about
the same size as the metadata to store the filenames we have now.

Problems:

   - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs of
rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
a different pool and those 

Re: newstore direction

2015-10-21 Thread Orit Wasserman
On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
> > The current design is based on two simple ideas:
> >
> >   1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >   2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> >
> >   - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> 
> If all of the fsync()'s fall into the same backing file system, are you sure 
> that each fsync() takes the same time? Depending on the local FS 
> implementation 
> of course, but the order of issuing those fsync()'s can effectively make some 
> of 
> them no-ops.
> 
> >
> >   - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> 
> This seems like a a pretty low hurdle to overcome.
> 
> >
> >   - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> 
> Are you using O_DIRECT? Seems like there should be some enterprisey database 
> tricks that we can use here.
> 
> >
> >   - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> 
> What is the goal of having the file system do the checksums? How strong do 
> they 
> need to be and what size are the chunks?
> 
> If you update this on each IO, this will certainly generate more IO (each 
> write 
> will possibly generate at least one other write to update that new checksum).
> 
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> The big problem with consuming block devices directly is that you ultimately 
> end 
> up recreating most of the features that you had in the file system. Even 
> enterprise databases like Oracle and DB2 have been migrating away from 
> running 
> on raw block devices in favor of file systems over time.  In effect, you are 
> looking at making a simple on disk file system which is always easier to 
> start 
> than it is to get back to a stable, production ready state.

The best performance is still on block device (SAN).
File system simplify the operation tasks which worth the performance
penalty for a database. I think in a storage system this is not the
case.
In many cases they can use their own file system that is tailored for
the database.

> I think that it might be quicker and more maintainable to spend some time 
> working with the local file system people (XFS or other) to see if we can 
> jointly address the concerns you have.
> >
> > Wins:
> >
> >   - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> >
> >   - No concern about mtime getting in the way
> >
> >   - Faster reads (no fs lookup)
> >
> >   - Similarly sized metadata for most objects.  If we assume most objects
> > are not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >   - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> >
> >   - We have to write and maintain an allocator.  I'm still optimistic this
> > can be reasonbly simple, especially for the flash case (where
> > fragmentation isn't such an issue as long as our blocks are reasonbly
> > sized).  For disk we may beed to be moderately clever.
> >
> >   - We'll need a fsck to ensure our internal metadata is consistent.  The
> > good news is it'll just need to validate 

RE: newstore direction

2015-10-20 Thread Dałek , Piotr
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 9:49 PM
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal metadata
> (object metadata, attrs, layout, collection membership, write-ahead logging,
> overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
> [..]
> 
> But what's the alternative?  My thought is to just bite the bullet and consume
> a raw block device directly.  Write an allocator, hopefully keep it pretty
> simple, and manage it in kv store along with all of our other metadata.

This is pretty much reinventing the file system, but...

I actually did something similar for my personal project (e-mail client), 
moving from maildir-like structure (each message was one file) to something 
resembling mbox (one large file per mail folder, containing pre-decoded 
structures for fast and easy access). And this worked out really well, 
especially with searches and bulk processing (filtering by body contents, and 
so on). I don't remember exact figures, but the performance benefit was in at 
least order of magnitude. If huge amounts of small-to-medium (0-128k) objects 
are the target, this is the way to go.

The most serious issue was fragmentation. Since I actually put my box files on 
top of actual FS (here: NTFS), low-level fragmentation was not a problem (each 
message was read and written in one fread/fwrite anyway). High-level 
fragmentation was an issue - each time a message was moved away, it still 
occupied space. To combat this, I wrote a space reclaimer that moved messages 
within box file (consolidated them) and maintained a bitmap of 4k free spaces, 
so I could re-use unused space without taking too much time iterating through 
messages and without calling reclaimer. Also, reclaimer was smart enough to not 
move messages one-by-one, but instead it loaded up to n messages in at most n 
reads (in common case it was less than that) and wrote them in one call and do 
its work until some space was actually reclaimed, instead of doing full garbage 
collection. Machinery was also aware of fact that messages were (mostly) 
appended to the end of box, so instead of blindly doing that, it moved back 
end-of-box pointer once messages at the end of box were deleted.
Other issue was reliability. Obviously, I had an option of secondary temp file, 
but still, everything above is doable without that.
Benefits included reduced requirements for metadata storage. Instead of 
generating unique ID (filename) for each message (apparently, message-id header 
is not reliable in that regard), I just stored offset and size (8+4 bytes per 
message), which, for 300 thousand of messages calculated to just 3,5MB of 
memory and could be kept in RAM. I/O performance has also improved due to less 
random access pattern (messages were physically close to each other instead of 
being scattered all over the drive)
For Ceph, benefits could be even greater. I can imagine faster deep scrubs that 
are way more efficient on spinning drives; efficient object storage (no 
per-object fragmentation and less disk-intensive object readahead, maybe with 
better support from hardware); possibly more reliability (when we fsync, we 
actually fsync - we don't get cheated by underlying FS), and we could get it 
optimized for particular devices (for example, most SSDs suck like vacuum on 
I/Os below 4k, so we could enforce I/Os of at least 4k).

Just my 0.02$.

With best regards / Pozdrawiam
Piotr Dałek


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Haomai Wang wrote:
> On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil  wrote:
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> This is really a tough decision. Although making a block device based
> objectstore never walk out my mind since two years ago.
> 
> We would much more concern about the effective of space utilization
> compared to local fs,  the buggy, the consuming time to build a tiny
> local filesystem. I'm a little afraid of we would stuck into
> 
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> 
> Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
> area from my perf.

With this change it is close to parity:

https://github.com/facebook/rocksdb/pull/746

> >  - No concern about mtime getting in the way
> >
> >  - Faster reads (no fs lookup)
> >
> >  - Similarly sized metadata for most objects.  If we assume most objects
> > are not fragmented, then the metadata to store the block offsets is about
> > the same size as the metadata to store the filenames we have now.
> >
> > Problems:
> >
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> >
> >  - We have to write and maintain an allocator.  I'm still optimistic this
> > can be reasonbly simple, especially for the flash case (where
> > fragmentation isn't such an issue as long as our blocks are reasonbly
> > sized).  For disk we may beed to be moderately clever.
> >
> >  - We'll need a fsck to ensure our internal metadata is consistent.  The
> > good news is it'll just need to validate what we have stored in the kv
> > store.
> >
> > Other thoughts:
> >
> >  - We might want to consider whether dm-thin or bcache or other block
> > layers might help us with elasticity of file vs block areas.
> >
> >  - Rocksdb can push colder data to a second directory, so we could have a
> > fast ssd primary area (for wal and most metadata) and a second hdd
> > directory for stuff it has to push off.  Then have a conservative amount
> > of file space on the hdd.  If our block fills up, use the existing file
> > mechanism to put data there too.  (But then we have to maintain both the
> > current kv + file approach and not go all-in on kv + block.)
> 
> A complex way...
> 
> Actually I would like to employ FileStore2 impl, which means we still
> use FileJournal(or alike ..). But we need to employ more memory to
> keep metadata/xattrs and use aio+dio to flush disk. A userspace
> pagecache needed to be impl. Then we can skip journal if full write,
> because osd is pg isolation we could make a barrier for single pg when
> skipping journal. @Sage Is there other concerns for filestore skip
> journal?
> 
> In a word, I like the model that filestore owns, but 

RE: newstore direction

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
> +1, nowadays K-V DB care more about very small key-value pairs, say 
> several bytes to a few KB, but in SSD case we only care about 4KB or 
> 8KB. In this way, NVMKV is a good design and seems some of the SSD 
> vendor are also trying to build this kind of interface, we had a NVM-L 
> library but still under development.

Do you have an NVMKV link?  I see a paper and a stale github repo.. not 
sure if I'm looking at the right thing.

My concern with using a key/value interface for the object data is that 
you end up with lots of key/value pairs (e.g., $inode_$offset = 
$4kb_of_data) that is pretty inefficient to store and (depending on the 
implementation) tends to break alignment.  I don't think these interfaces 
are targetted toward block-sized/aligned payloads.  Storing just the 
metadata (block allocation map) w/ the kv api and storing the data 
directly on a block/page interface makes more sense to me.

sage


> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> > Sent: Tuesday, October 20, 2015 6:21 AM
> > To: Sage Weil; Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive  solution than raw
> > block device base keyvalue store as backend for objectstore. The new key
> > value  SSD device with transaction support would be  ideal to solve the 
> > issues.
> > First of all, it is raw SSD device. Secondly , It provides key value 
> > interface
> > directly from SSD. Thirdly, it can provide transaction support, consistency 
> > will
> > be guaranteed by hardware device. It pretty much satisfied all of 
> > objectstore
> > needs without any extra overhead since there is not any extra layer in
> > between device and objectstore.
> >Either way, I strongly support to have CEPH own data format instead of
> > relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v
> > > dbs (for storing allocators and all). The reason is the unknown write
> > > amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it as
> > appropriate) so that other backends can be easily swapped in (e.g. a btree-
> > based one for high-end flash).
> > 
> > sage
> > 
> > 
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > >
> > > -Original Message-
> > > From: ceph-devel-ow...@vger.kernel.org
> > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > >
> > > The current design is based on two simple ideas:
> > >
> > >  1) a key/value interface is better way to manage all of our internal
> > > metadata (object metadata, attrs, layout, collection membership,
> > > write-ahead logging, overlay data, etc.)
> > >
> > >  2) a file system is well suited for storage object data (as files).
> > >
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > > few
> > > things:
> > >
> > >  - We currently write the data to the file, fsync, then commit the kv
> > > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > > journal, one for the kv txn to commit (at least once my rocksdb
> > > changes land... the kv commit is currently 2-3).  So two people are
> > > managing metadata, here: the fs managing the file metadata (with its
> > > own
> > > journal) and the kv backend (with its journal).
> > >
> > >  - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but 
> > at a
> > minim

RE: newstore direction

2015-10-20 Thread Sage Weil
On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive solution than 
> raw block device base keyvalue store as backend for objectstore. The new 
> key value SSD device with transaction support would be ideal to solve 
> the issues. First of all, it is raw SSD device. Secondly , It provides 
> key value interface directly from SSD. Thirdly, it can provide 
> transaction support, consistency will be guaranteed by hardware device. 
> It pretty much satisfied all of objectstore needs without any extra 
> overhead since there is not any extra layer in between device and 
> objectstore.

Are you talking about open channel SSDs?  Or something else?  Everything 
I'm familiar with that is currently shipping is exposing a vanilla block 
interface (conventional SSDs) that hides all of that or NVMe (which isn't 
much better).

If there is a low-level KV interface we can consume that would be 
great--especially if we can glue it to our KeyValueDB abstract API.  Even 
so, we need to make sure that the object *data* also has an efficient API 
we can utilize that efficiently handles block-sized/aligned data.

sage


>Either way, I strongly support to have CEPH own data format instead 
> of relying on filesystem.
> 
>   Regards,
>   James
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get 
> > rid of this filesystem overhead (which I am in process of measuring). 
> > Also, it will be good if we can eliminate the dependency on the k/v 
> > dbs (for storing allocators and all). The reason is the unknown write 
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a 
> btree-based one for high-end flash).
> 
> sage
> 
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org 
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> > 
> > The current design is based on two simple ideas:
> > 
> >  1) a key/value interface is better way to manage all of our internal 
> > metadata (object metadata, attrs, layout, collection membership, 
> > write-ahead logging, overlay data, etc.)
> > 
> >  2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> > few
> > things:
> > 
> >  - We currently write the data to the file, fsync, then commit the kv 
> > transaction.  That's at least 3 IOs: one for the data, one for the fs 
> > journal, one for the kv txn to commit (at least once my rocksdb 
> > changes land... the kv commit is currently 2-3).  So two people are 
> > managing metadata, here: the fs managing the file metadata (with its 
> > own
> > journal) and the kv backend (with its journal).
> > 
> >  - On read we have to open files by name, which means traversing the fs 
> > namespace.  Newstore tries to keep it as flat and simple as possible, but 
> > at a minimum it is a couple btree lookups.  We'd love to use open by handle 
> > (which would reduce this to 1 btree traversal), but running the daemon as 
> > ceph and not root makes that hard...
> > 
> >  - ...and file systems insist on updating mtime on writes, even when it is 
> > a overwrite with no allocation changes.  (We don't care about mtime.) 
> > O_NOCMTIME patches exist but it is hard to get these past the kernel 
> > brainfreeze.
> > 
> >  - XFS is (probably) never going going to give us data checksums, which we 
> > want desperately.
> > 
> > But what's the alternative?  My thought is to just bite the bullet and 
> > consume a raw block device directly.  Write an allocator, hopefully keep it 
> > pretty simple, and manage it in kv store along with all of our other 
> > metadata.
> > 
> > Wins:
> > 
> >  - 2 IOs for most: one to write the data to unused space in the block 
> > device, one to commit our transaction (vs 4+ before).  For overwrites, we'd 
> > have one io to do our write-ahead log

Re: newstore direction

2015-10-20 Thread Ric Wheeler

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

  1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

  2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

  - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb changes
land... the kv commit is currently 2-3).  So two people are managing
metadata, here: the fs managing the file metadata (with its own
journal) and the kv backend (with its journal).


If all of the fsync()'s fall into the same backing file system, are you sure 
that each fsync() takes the same time? Depending on the local FS implementation 
of course, but the order of issuing those fsync()'s can effectively make some of 
them no-ops.




  - On read we have to open files by name, which means traversing the fs
namespace.  Newstore tries to keep it as flat and simple as possible, but
at a minimum it is a couple btree lookups.  We'd love to use open by
handle (which would reduce this to 1 btree traversal), but running
the daemon as ceph and not root makes that hard...


This seems like a a pretty low hurdle to overcome.



  - ...and file systems insist on updating mtime on writes, even when it is
a overwrite with no allocation changes.  (We don't care about mtime.)
O_NOCMTIME patches exist but it is hard to get these past the kernel
brainfreeze.


Are you using O_DIRECT? Seems like there should be some enterprisey database 
tricks that we can use here.




  - XFS is (probably) never going going to give us data checksums, which we
want desperately.


What is the goal of having the file system do the checksums? How strong do they 
need to be and what size are the chunks?


If you update this on each IO, this will certainly generate more IO (each write 
will possibly generate at least one other write to update that new checksum).




But what's the alternative?  My thought is to just bite the bullet and
consume a raw block device directly.  Write an allocator, hopefully keep
it pretty simple, and manage it in kv store along with all of our other
metadata.


The big problem with consuming block devices directly is that you ultimately end 
up recreating most of the features that you had in the file system. Even 
enterprise databases like Oracle and DB2 have been migrating away from running 
on raw block devices in favor of file systems over time.  In effect, you are 
looking at making a simple on disk file system which is always easier to start 
than it is to get back to a stable, production ready state.


I think that it might be quicker and more maintainable to spend some time 
working with the local file system people (XFS or other) to see if we can 
jointly address the concerns you have.


Wins:

  - 2 IOs for most: one to write the data to unused space in the block
device, one to commit our transaction (vs 4+ before).  For overwrites,
we'd have one io to do our write-ahead log (kv journal), then do
the overwrite async (vs 4+ before).

  - No concern about mtime getting in the way

  - Faster reads (no fs lookup)

  - Similarly sized metadata for most objects.  If we assume most objects
are not fragmented, then the metadata to store the block offsets is about
the same size as the metadata to store the filenames we have now.

Problems:

  - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs of
rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
a different pool and those aren't currently fungible.

  - We have to write and maintain an allocator.  I'm still optimistic this
can be reasonbly simple, especially for the flash case (where
fragmentation isn't such an issue as long as our blocks are reasonbly
sized).  For disk we may beed to be moderately clever.

  - We'll need a fsck to ensure our internal metadata is consistent.  The
good news is it'll just need to validate what we have stored in the kv
store.

Other thoughts:

  - We might want to consider whether dm-thin or bcache or other block
layers might help us with elasticity of file vs block areas.

  - Rocksdb can push colder data to a second directory, so we could have a
fast ssd primary area (for wal and most metadata) and a second hdd
directory for stuff it has to push off.  Then have a conservative amount
of file space on the hdd.  If our block fills up, use the existing file
mechanism to put data there too.  (But then we have to maintain both the
current kv + file approach and not go all-in on kv + block.)

Thoughts?
sage
--


I really hate the 

Re: newstore direction

2015-10-20 Thread kernel neophyte
On Tue, Oct 20, 2015 at 6:19 AM, Mark Nelson <mnel...@redhat.com> wrote:
> On 10/20/2015 07:30 AM, Sage Weil wrote:
>>
>> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:
>>>
>>> +1, nowadays K-V DB care more about very small key-value pairs, say
>>> several bytes to a few KB, but in SSD case we only care about 4KB or
>>> 8KB. In this way, NVMKV is a good design and seems some of the SSD
>>> vendor are also trying to build this kind of interface, we had a NVM-L
>>> library but still under development.
>>
>>
>> Do you have an NVMKV link?  I see a paper and a stale github repo.. not
>> sure if I'm looking at the right thing.
>>
>> My concern with using a key/value interface for the object data is that
>> you end up with lots of key/value pairs (e.g., $inode_$offset =
>> $4kb_of_data) that is pretty inefficient to store and (depending on the
>> implementation) tends to break alignment.  I don't think these interfaces
>> are targetted toward block-sized/aligned payloads.  Storing just the
>> metadata (block allocation map) w/ the kv api and storing the data
>> directly on a block/page interface makes more sense to me.
>>
>> sage
>
>
> I get the feeling that some of the folks that were involved with nvmkv at
> Fusion IO have left.  Nisha Talagala is now out at Parallel Systems for
> instance.  http://pmem.io might be a better bet, though I haven't looked
> closely at it.
>

IMO pmem.io is more suited for SCM (Storage Class Memory) than for SSD's.

If Newstore is target towards production deployments (Eventually
replacing FileStore someday) then IMO I agree with sage, i.e. rely on
a file system for doing block allocation.

-Neo


> Mark
>
>
>>
>>
>>>> -Original Message-
>>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
>>>> Sent: Tuesday, October 20, 2015 6:21 AM
>>>> To: Sage Weil; Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> Hi Sage and Somnath,
>>>>In my humble opinion, There is another more aggressive  solution than
>>>> raw
>>>> block device base keyvalue store as backend for objectstore. The new key
>>>> value  SSD device with transaction support would be  ideal to solve the
>>>> issues.
>>>> First of all, it is raw SSD device. Secondly , It provides key value
>>>> interface
>>>> directly from SSD. Thirdly, it can provide transaction support,
>>>> consistency will
>>>> be guaranteed by hardware device. It pretty much satisfied all of
>>>> objectstore
>>>> needs without any extra overhead since there is not any extra layer in
>>>> between device and objectstore.
>>>> Either way, I strongly support to have CEPH own data format instead
>>>> of
>>>> relying on filesystem.
>>>>
>>>>Regards,
>>>>James
>>>>
>>>> -Original Message-
>>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>>> ow...@vger.kernel.org] On Behalf Of Sage Weil
>>>> Sent: Monday, October 19, 2015 1:55 PM
>>>> To: Somnath Roy
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: RE: newstore direction
>>>>
>>>> On Mon, 19 Oct 2015, Somnath Roy wrote:
>>>>>
>>>>> Sage,
>>>>> I fully support that.  If we want to saturate SSDs , we need to get
>>>>> rid of this filesystem overhead (which I am in process of measuring).
>>>>> Also, it will be good if we can eliminate the dependency on the k/v
>>>>> dbs (for storing allocators and all). The reason is the unknown write
>>>>> amps they causes.
>>>>
>>>>
>>>> My hope is to keep behing the KeyValueDB interface (and/more change it
>>>> as
>>>> appropriate) so that other backends can be easily swapped in (e.g. a
>>>> btree-
>>>> based one for high-end flash).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>>
>>>>> -Original Message-
>>>>> From: ceph-devel-ow...@vger.kernel.org
>>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
>>>>> Sent: Monday, October 19, 2015 12:49 PM
>>

Re: newstore direction

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Ric Wheeler wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
> > The current design is based on two simple ideas:
> > 
> >   1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> > 
> >   2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> > things:
> > 
> >   - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb changes
> > land... the kv commit is currently 2-3).  So two people are managing
> > metadata, here: the fs managing the file metadata (with its own
> > journal) and the kv backend (with its journal).
> 
> If all of the fsync()'s fall into the same backing file system, are you sure
> that each fsync() takes the same time? Depending on the local FS
> implementation of course, but the order of issuing those fsync()'s can
> effectively make some of them no-ops.

Surely, yes, but the fact remains we are maintaining two journals: one 
internal to the fs that manages the allocation metadata, and one layered 
on top that handles the kv store's write stream.  The lower bound on any 
write is 3 IOs (unless we're talking about a COW fs).

> >   - On read we have to open files by name, which means traversing the fs
> > namespace.  Newstore tries to keep it as flat and simple as possible, but
> > at a minimum it is a couple btree lookups.  We'd love to use open by
> > handle (which would reduce this to 1 btree traversal), but running
> > the daemon as ceph and not root makes that hard...
> 
> This seems like a a pretty low hurdle to overcome.

I wish you luck convincing upstream to allow unprivileged access to 
open_by_handle or the XFS ioctl.  :)  But even if we had that, any object 
access requires multiple metadata lookups: one in our kv db, and a second 
to get the inode for the backing file.  Again, there's an unnecessary 
lower bound on the number of IOs needed to access a cold object.

> >   - ...and file systems insist on updating mtime on writes, even when it is
> > a overwrite with no allocation changes.  (We don't care about mtime.)
> > O_NOCMTIME patches exist but it is hard to get these past the kernel
> > brainfreeze.
> 
> Are you using O_DIRECT? Seems like there should be some enterprisey database
> tricks that we can use here.

It's not about about the data path, but avoiding the useless bookkeeping 
the file system is doing that we don't want or need.  See the recent 
recent reception of Zach's O_NOCMTIME patches on linux-fsdevel:

http://marc.info/?t=14309496981=1=2

I'm generally an optimist when it comes to introducing new APIs upstream, 
but I still found this to be an unbelievingly frustrating exchange.

> >   - XFS is (probably) never going going to give us data checksums, which we
> > want desperately.
> 
> What is the goal of having the file system do the checksums? How strong do
> they need to be and what size are the chunks?
> 
> If you update this on each IO, this will certainly generate more IO (each
> write will possibly generate at least one other write to update that new
> checksum).

Not if we keep the checksums with the allocation metadata, in the 
onode/inode, which we're also doing and IO to persist.  But whther that is 
practial depends on the granularity (4KB or 16K or 128K or ...), which may 
in turn depend on the object (RBD block that'll service random 4K reads 
and writes?  or RGW fragment that is always written sequentially?).  I'm 
highly skeptical we'd ever get anything from a general-purpose file system 
that would work well here (if anything at all).

> > But what's the alternative?  My thought is to just bite the bullet and
> > consume a raw block device directly.  Write an allocator, hopefully keep
> > it pretty simple, and manage it in kv store along with all of our other
> > metadata.
> 
> The big problem with consuming block devices directly is that you ultimately
> end up recreating most of the features that you had in the file system. Even
> enterprise databases like Oracle and DB2 have been migrating away from running
> on raw block devices in favor of file systems over time.  In effect, you are
> looking at making a simple on disk file system which is always easier to start
> than it is to get back to a stable, production ready state.

This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had 
everything we were implementing and more: mainly, copy on write and data 
checksums.  But in practice the fact that its general purpose means it 
targets a very different workloads and APIs than what we need.

Now that I've realized the POSIX file namespace is a bad fit for what we 
need and opted to manage that directly, things are 

Re: newstore direction

2015-10-20 Thread Martin Millnert
Adding to this,

On Tue, 2015-10-20 at 05:34 -0700, Sage Weil wrote:
> On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive solution than 
> > raw block device base keyvalue store as backend for objectstore. The new 
> > key value SSD device with transaction support would be ideal to solve 
> > the issues. First of all, it is raw SSD device. Secondly , It provides 
> > key value interface directly from SSD. Thirdly, it can provide 
> > transaction support, consistency will be guaranteed by hardware device. 
> > It pretty much satisfied all of objectstore needs without any extra 
> > overhead since there is not any extra layer in between device and 
> > objectstore.
> 
> Are you talking about open channel SSDs?  Or something else?  Everything 
> I'm familiar with that is currently shipping is exposing a vanilla block 
> interface (conventional SSDs) that hides all of that or NVMe (which isn't 
> much better).
> 
> If there is a low-level KV interface we can consume that would be 
> great--especially if we can glue it to our KeyValueDB abstract API.  Even 
> so, we need to make sure that the object *data* also has an efficient API 
> we can utilize that efficiently handles block-sized/aligned data.

If there's a way to efficiently utilize more generic NVRAM-based block
devices for quick metadata ops such that payload data can fly without
much delay, I'd be quite happy. 

Also, a current concern of mine is backups in some fashion of the
metadata, given risk for (human configuration error||device
malfunction)&&(cluster wide power outage).
Some type of flushing to underlying consistent media, and/or
snapshot-like backups.

As long as the constructs aren't too exotic,  perhaps this could be
addressed using standard Linux FS or device mapper code (bcache, or
other)

Not sure how popular journals on NVRAM is. But here's one user at least.

/M


> sage
> 
> 
> >Either way, I strongly support to have CEPH own data format instead 
> > of relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org 
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get 
> > > rid of this filesystem overhead (which I am in process of measuring). 
> > > Also, it will be good if we can eliminate the dependency on the k/v 
> > > dbs (for storing allocators and all). The reason is the unknown write 
> > > amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it as
> > appropriate) so that other backends can be easily swapped in (e.g. a 
> > btree-based one for high-end flash).
> > 
> > sage
> > 
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -Original Message-
> > > From: ceph-devel-ow...@vger.kernel.org 
> > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > > 
> > > The current design is based on two simple ideas:
> > > 
> > >  1) a key/value interface is better way to manage all of our internal 
> > > metadata (object metadata, attrs, layout, collection membership, 
> > > write-ahead logging, overlay data, etc.)
> > > 
> > >  2) a file system is well suited for storage object data (as files).
> > > 
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> > > few
> > > things:
> > > 
> > >  - We currently write the data to the file, fsync, then commit the kv 
> > > transaction.  That's at least 3 IOs: one for the data, one for the fs 
> > > journal, one for the kv txn to commit (at least once my rocksdb 
> > > changes land... the kv commit is currently 2-3).  So two people are 
> > > managing metadata, here: the fs managing the file metadata (with its 
> > > own
> > > journal) and the kv backend (with its journal).
> > > 
> > >  - On read we have to open files by name, which means traversing the fs 
> > > namespace.  Newstore tries to keep it as flat and simple as possibl

Re: newstore direction

2015-10-20 Thread Gregory Farnum
On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil  wrote:
> On Tue, 20 Oct 2015, Ric Wheeler wrote:
>> The big problem with consuming block devices directly is that you ultimately
>> end up recreating most of the features that you had in the file system. Even
>> enterprise databases like Oracle and DB2 have been migrating away from 
>> running
>> on raw block devices in favor of file systems over time.  In effect, you are
>> looking at making a simple on disk file system which is always easier to 
>> start
>> than it is to get back to a stable, production ready state.
>
> This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
> everything we were implementing and more: mainly, copy on write and data
> checksums.  But in practice the fact that its general purpose means it
> targets a very different workloads and APIs than what we need.

Try 7 years since ebofs...
That's one of my concerns, though. You ditched ebofs once already
because it had metastasized into an entire FS, and had reached its
limits of maintainability. What makes you think a second time through
would work better? :/

On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil  wrote:
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

I can't work this one out. If you're doing one write for the data and
one for the kv journal (which is on another filesystem), how does the
commit sequence work that it's only 2 IOs instead of the same 3 we
already have? Or are you planning to ditch the LevelDB/RocksDB store
for our journaling and just use something within the block layer?


If we do want to go down this road, we shouldn't need to write an
allocator from scratch. I don't remember exactly which ones it is but
we've read/seen at least a few storage papers where people have reused
existing allocators  — I think the one from ext2? And somebody managed
to get it running in userspace.

Of course, then we also need to figure out how to get checksums on the
block data, since if we're going to put in the effort to reimplement
this much of the stack we'd better get our full data integrity
guarantees along with it!

On Tue, Oct 20, 2015 at 1:00 PM, Sage Weil  wrote:
> On Tue, 20 Oct 2015, John Spray wrote:
>> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
>> >  - We have to size the kv backend storage (probably still an XFS
>> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> > SSD!) so it won't matter.  But what happens when we are storing gobs of
>> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> > a different pool and those aren't currently fungible.
>>
>> This is the concerning bit for me -- the other parts one "just" has to
>> get the code right, but this problem could linger and be something we
>> have to keep explaining to users indefinitely.  It reminds me of cases
>> in other systems where users had to make an educated guess about inode
>> size up front, depending on whether you're expecting to efficiently
>> store a lot of xattrs.
>>
>> In practice it's rare for users to make these kinds of decisions well
>> up-front: it really needs to be adjustable later, ideally
>> automatically.  That could be pretty straightforward if the KV part
>> was stored directly on block storage, instead of having XFS in the
>> mix.  I'm not quite up with the state of the art in this area: are
>> there any reasonable alternatives for the KV part that would consume
>> some defined range of a block device from userspace, instead of
>> sitting on top of a filesystem?
>
> I agree: this is my primary concern with the raw block approach.
>
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
>
> I see two basic options:
>
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).

This seems like the obviously correct move to me? Except we might want
to include the rocksdb store on flash instead of hard drives, which
means maybe we do want some unified storage system which can handle
multiple physical storage devices as a single piece of storage space.
(Not that any of those exist in "almost done" hell, or that we're
going through requirements expansion or anything!)
-Greg
--
To unsubscribe from this list: send the 

Re: newstore direction

2015-10-20 Thread Ric Wheeler

On 10/20/2015 03:44 PM, Sage Weil wrote:

On Tue, 20 Oct 2015, Ric Wheeler wrote:

On 10/19/2015 03:49 PM, Sage Weil wrote:

The current design is based on two simple ideas:

   1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

   2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

   - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb changes
land... the kv commit is currently 2-3).  So two people are managing
metadata, here: the fs managing the file metadata (with its own
journal) and the kv backend (with its journal).

If all of the fsync()'s fall into the same backing file system, are you sure
that each fsync() takes the same time? Depending on the local FS
implementation of course, but the order of issuing those fsync()'s can
effectively make some of them no-ops.

Surely, yes, but the fact remains we are maintaining two journals: one
internal to the fs that manages the allocation metadata, and one layered
on top that handles the kv store's write stream.  The lower bound on any
write is 3 IOs (unless we're talking about a COW fs).


The way storage devices work means that if we can batch these in some way, we 
might get 3 IO's that land in the cache (even for spinning drives) and one 1 
that is followed by a cache flush.


The first three IO's are quite quick, you don't need to write through to the 
platter. The cost is mostly in the fsync() call which waits until storage 
destages the cache to the platter.


With SSD's, we have some different considerations.




   - On read we have to open files by name, which means traversing the fs
namespace.  Newstore tries to keep it as flat and simple as possible, but
at a minimum it is a couple btree lookups.  We'd love to use open by
handle (which would reduce this to 1 btree traversal), but running
the daemon as ceph and not root makes that hard...

This seems like a a pretty low hurdle to overcome.

I wish you luck convincing upstream to allow unprivileged access to
open_by_handle or the XFS ioctl.  :)  But even if we had that, any object
access requires multiple metadata lookups: one in our kv db, and a second
to get the inode for the backing file.  Again, there's an unnecessary
lower bound on the number of IOs needed to access a cold object.


We should dig into what this actually means when you can do open by handle. If 
you cache the inode (i.e., skip the directory traversal), you still need to 
figure out the mapping back to an actual block on the storage device. Not clear 
to me that you need more IO's with the file system doing this or by having a 
btree on disk - both will require IO.





   - ...and file systems insist on updating mtime on writes, even when it is
a overwrite with no allocation changes.  (We don't care about mtime.)
O_NOCMTIME patches exist but it is hard to get these past the kernel
brainfreeze.

Are you using O_DIRECT? Seems like there should be some enterprisey database
tricks that we can use here.

It's not about about the data path, but avoiding the useless bookkeeping
the file system is doing that we don't want or need.  See the recent
recent reception of Zach's O_NOCMTIME patches on linux-fsdevel:

http://marc.info/?t=14309496981=1=2

I'm generally an optimist when it comes to introducing new APIs upstream,
but I still found this to be an unbelievingly frustrating exchange.


We should talk more about this with the local FS people. Might be other ways to 
solve this.





   - XFS is (probably) never going going to give us data checksums, which we
want desperately.

What is the goal of having the file system do the checksums? How strong do
they need to be and what size are the chunks?

If you update this on each IO, this will certainly generate more IO (each
write will possibly generate at least one other write to update that new
checksum).

Not if we keep the checksums with the allocation metadata, in the
onode/inode, which we're also doing and IO to persist.  But whther that is
practial depends on the granularity (4KB or 16K or 128K or ...), which may
in turn depend on the object (RBD block that'll service random 4K reads
and writes?  or RGW fragment that is always written sequentially?).  I'm
highly skeptical we'd ever get anything from a general-purpose file system
that would work well here (if anything at all).


XFS (or device mapper) could also store checksums per block. I think that the 
T10 DIF/DIX bits work for enterprise databases (again, bypassing the file 
system). Might be interesting to see if we could put the checksums into dm-thin.





But what's the alternative?  My thought is to just bite the bullet and
consume a raw 

Re: newstore direction

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, Gregory Farnum wrote:
> On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil  wrote:
> > On Tue, 20 Oct 2015, Ric Wheeler wrote:
> >> The big problem with consuming block devices directly is that you 
> >> ultimately
> >> end up recreating most of the features that you had in the file system. 
> >> Even
> >> enterprise databases like Oracle and DB2 have been migrating away from 
> >> running
> >> on raw block devices in favor of file systems over time.  In effect, you 
> >> are
> >> looking at making a simple on disk file system which is always easier to 
> >> start
> >> than it is to get back to a stable, production ready state.
> >
> > This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
> > everything we were implementing and more: mainly, copy on write and data
> > checksums.  But in practice the fact that its general purpose means it
> > targets a very different workloads and APIs than what we need.
> 
> Try 7 years since ebofs...

Sigh...

> That's one of my concerns, though. You ditched ebofs once already
> because it had metastasized into an entire FS, and had reached its
> limits of maintainability. What makes you think a second time through
> would work better? :/

A fair point, and I've given this some thought:

1) We know a *lot* more about our workload than I did in 2005.  The things 
I was worrying about then (fragmentation, mainly) are much easier to 
address now, where we have hints from rados and understand what the write 
patterns look like in practice (randomish 4k-128k ios for rbd, sequential 
writes for rgw, and the cephfs wildcard).

2) Most of the ebofs effort was around doing copy-on-write btrees (with 
checksums) and orchestrating commits.  Here our job is *vastly* simplified 
by assuming the existence of a transactional key/value store.  If you look 
at newstore today, we're already half-way through dealing with the 
complexity of doing allocations... we're essentially "allocating" blocks 
that are 1 MB files on XFS, managing that metadata, and overwriting or 
replacing those blocks on write/truncate/clone.  By the time we add in an 
allocator (get_blocks(len), free_block(offset, len)) and rip out all the 
file handling fiddling (like fsync workqueues, file id allocator, 
file truncation fiddling, etc.) we'll probably have something working 
with about the same amount of code we have now.  (Of course, that'll 
grow as we get more sophisticated, but that'll happen either way.)

> On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil  wrote:
> >  - 2 IOs for most: one to write the data to unused space in the block
> > device, one to commit our transaction (vs 4+ before).  For overwrites,
> > we'd have one io to do our write-ahead log (kv journal), then do
> > the overwrite async (vs 4+ before).
> 
> I can't work this one out. If you're doing one write for the data and
> one for the kv journal (which is on another filesystem), how does the
> commit sequence work that it's only 2 IOs instead of the same 3 we
> already have? Or are you planning to ditch the LevelDB/RocksDB store
> for our journaling and just use something within the block layer?

Now:
1 io  to write a new file
  1-2 ios to sync the fs journal (commit the inode, alloc change) 
  (I see 2 journal IOs on XFS and only 1 on ext4...)
1 io  to commit the rocksdb journal (currently 3, but will drop to 
  1 with xfs fix and my rocksdb change)

With block:
1 io to write to block device
1 io to commit to rocksdb journal

> If we do want to go down this road, we shouldn't need to write an
> allocator from scratch. I don't remember exactly which ones it is but
> we've read/seen at least a few storage papers where people have reused
> existing allocators  ? I think the one from ext2? And somebody managed
> to get it running in userspace.

Maybe, but the real win is when we combine the allocator state update with 
our kv transaction.  Even if we adopt an existing algorithm we'll need to 
do some significant rejiggering to persist it in the kv store.

My thought is start with something simple that works (e.g., linear sweep 
over free space, simple interval_set<>-style freelist) and once it works 
look at existing state of the art for a clever v2.

BTW, I suspect a modest win here would be to simply use the collection/pg 
as a hint for storing related objects.  That's the best indicator we have 
for aligned lifecycle (think PG migrations/deletions vs flash erase 
blocks).  Good luck plumbing that through XFS...

> Of course, then we also need to figure out how to get checksums on the
> block data, since if we're going to put in the effort to reimplement
> this much of the stack we'd better get our full data integrity
> guarantees along with it!

YES!

Here I think we should make judicious use of the rados hints.  For 
example, rgw always writes complete objects, so we can have coarse 
granularity crcs and only pay for very small reads (that have 

Re: newstore direction

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, John Spray wrote:
> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
> >  - We have to size the kv backend storage (probably still an XFS
> > partition) vs the block storage.  Maybe we do this anyway (put metadata on
> > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> > a different pool and those aren't currently fungible.
> 
> This is the concerning bit for me -- the other parts one "just" has to
> get the code right, but this problem could linger and be something we
> have to keep explaining to users indefinitely.  It reminds me of cases
> in other systems where users had to make an educated guess about inode
> size up front, depending on whether you're expecting to efficiently
> store a lot of xattrs.
> 
> In practice it's rare for users to make these kinds of decisions well
> up-front: it really needs to be adjustable later, ideally
> automatically.  That could be pretty straightforward if the KV part
> was stored directly on block storage, instead of having XFS in the
> mix.  I'm not quite up with the state of the art in this area: are
> there any reasonable alternatives for the KV part that would consume
> some defined range of a block device from userspace, instead of
> sitting on top of a filesystem?

I agree: this is my primary concern with the raw block approach.

There are some KV alternatives that could consume block, but the problem 
would be similar: we need to dynamically size up or down the kv portion of 
the device.

I see two basic options:

1) Wire into the Env abstraction in rocksdb to provide something just 
smart enough to let rocksdb work.  It isn't much: named files (not that 
many--we could easily keep the file table in ram), always written 
sequentially, to be read later with random access. All of the code is 
written around abstractions of SequentialFileWriter so that everything 
posix is neatly hidden in env_posix (and there are various other env 
implementations for in-memory mock tests etc.).

2) Use something like dm-thin to sit between the raw block device and XFS 
(for rocksdb) and the block device consumed by newstore.  As long as XFS 
doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb 
files in their entirety) we can fstrim and size down the fs portion.  If 
we similarly make newstores allocator stick to large blocks only we would 
be able to size down the block portion as well.  Typical dm-thin block 
sizes seem to range from 64KB to 512KB, which seems reasonable enough to 
me.  In fact, we could likely just size the fs volume at something 
conservatively large (like 90%) and rely on -o discard or periodic fstrim 
to keep its actual utilization in check.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-20 Thread Yehuda Sadeh-Weinraub
On Tue, Oct 20, 2015 at 11:31 AM, Ric Wheeler  wrote:
> On 10/19/2015 03:49 PM, Sage Weil wrote:
>>
>> The current design is based on two simple ideas:
>>
>>   1) a key/value interface is better way to manage all of our internal
>> metadata (object metadata, attrs, layout, collection membership,
>> write-ahead logging, overlay data, etc.)
>>
>>   2) a file system is well suited for storage object data (as files).
>>
>> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
>> things:
>>
>>   - We currently write the data to the file, fsync, then commit the kv
>> transaction.  That's at least 3 IOs: one for the data, one for the fs
>> journal, one for the kv txn to commit (at least once my rocksdb changes
>> land... the kv commit is currently 2-3).  So two people are managing
>> metadata, here: the fs managing the file metadata (with its own
>> journal) and the kv backend (with its journal).
>
>
> If all of the fsync()'s fall into the same backing file system, are you sure
> that each fsync() takes the same time? Depending on the local FS
> implementation of course, but the order of issuing those fsync()'s can
> effectively make some of them no-ops.
>
>>
>>   - On read we have to open files by name, which means traversing the fs
>> namespace.  Newstore tries to keep it as flat and simple as possible, but
>> at a minimum it is a couple btree lookups.  We'd love to use open by
>> handle (which would reduce this to 1 btree traversal), but running
>> the daemon as ceph and not root makes that hard...
>
>
> This seems like a a pretty low hurdle to overcome.
>
>>
>>   - ...and file systems insist on updating mtime on writes, even when it
>> is
>> a overwrite with no allocation changes.  (We don't care about mtime.)
>> O_NOCMTIME patches exist but it is hard to get these past the kernel
>> brainfreeze.
>
>
> Are you using O_DIRECT? Seems like there should be some enterprisey database
> tricks that we can use here.
>
>>
>>   - XFS is (probably) never going going to give us data checksums, which
>> we
>> want desperately.
>
>
> What is the goal of having the file system do the checksums? How strong do
> they need to be and what size are the chunks?
>
> If you update this on each IO, this will certainly generate more IO (each
> write will possibly generate at least one other write to update that new
> checksum).
>
>>
>> But what's the alternative?  My thought is to just bite the bullet and
>> consume a raw block device directly.  Write an allocator, hopefully keep
>> it pretty simple, and manage it in kv store along with all of our other
>> metadata.
>
>
> The big problem with consuming block devices directly is that you ultimately
> end up recreating most of the features that you had in the file system. Even
> enterprise databases like Oracle and DB2 have been migrating away from
> running on raw block devices in favor of file systems over time.  In effect,
> you are looking at making a simple on disk file system which is always
> easier to start than it is to get back to a stable, production ready state.
>
> I think that it might be quicker and more maintainable to spend some time
> working with the local file system people (XFS or other) to see if we can
> jointly address the concerns you have.
>
>>
>> Wins:
>>
>>   - 2 IOs for most: one to write the data to unused space in the block
>> device, one to commit our transaction (vs 4+ before).  For overwrites,
>> we'd have one io to do our write-ahead log (kv journal), then do
>> the overwrite async (vs 4+ before).
>>
>>   - No concern about mtime getting in the way
>>
>>   - Faster reads (no fs lookup)
>>
>>   - Similarly sized metadata for most objects.  If we assume most objects
>> are not fragmented, then the metadata to store the block offsets is about
>> the same size as the metadata to store the filenames we have now.
>>
>> Problems:
>>
>>   - We have to size the kv backend storage (probably still an XFS
>> partition) vs the block storage.  Maybe we do this anyway (put metadata on
>> SSD!) so it won't matter.  But what happens when we are storing gobs of
>> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
>> a different pool and those aren't currently fungible.
>>
>>   - We have to write and maintain an allocator.  I'm still optimistic this
>> can be reasonbly simple, especially for the flash case (where
>> fragmentation isn't such an issue as long as our blocks are reasonbly
>> sized).  For disk we may beed to be moderately clever.
>>
>>   - We'll need a fsck to ensure our internal metadata is consistent.  The
>> good news is it'll just need to validate what we have stored in the kv
>> store.
>>
>> Other thoughts:
>>
>>   - We might want to consider whether dm-thin or bcache or other block
>> layers might help us with elasticity of file vs block areas.
>>
>>   - Rocksdb can push colder data to a second directory, so we could have a
>> fast ssd primary area (for wal and most metadata) and a 

RE: newstore direction

2015-10-20 Thread James (Fei) Liu-SSI
Varada,

Hopefully , It will answer yours question too. It is going to be new type of 
key value device than traditional hard drive based OSD device. It will have its 
own storage stack than traditional block based storage stack. I have to admit 
it is a little bit more aggressive than block based approach .  

Regards,
James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, October 20, 2015 1:33 PM
To: Sage Weil
Cc: Somnath Roy; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

Hi Sage, 
   Sorry for confusing you. SSDs with key value interfaces are still under 
development by several vendors.  It has totally different design approach  than 
Open Channel SSD. I met Matias several months ago and discussed about 
possibilities to have key value interface support with  Open Channel SSD . I am 
not following the progress since then. If Matias is in this group, He will 
definitely can give us better explanations. Here is his presentation for key 
value support with open channel SSD for your reference.

http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf


  Regards,
  James  

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Tuesday, October 20, 2015 5:34 AM
To: James (Fei) Liu-SSI
Cc: Somnath Roy; ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive solution than 
> raw block device base keyvalue store as backend for objectstore. The 
> new key value SSD device with transaction support would be ideal to 
> solve the issues. First of all, it is raw SSD device. Secondly , It 
> provides key value interface directly from SSD. Thirdly, it can 
> provide transaction support, consistency will be guaranteed by hardware 
> device.
> It pretty much satisfied all of objectstore needs without any extra 
> overhead since there is not any extra layer in between device and 
> objectstore.

Are you talking about open channel SSDs?  Or something else?  Everything I'm 
familiar with that is currently shipping is exposing a vanilla block interface 
(conventional SSDs) that hides all of that or NVMe (which isn't much better).

If there is a low-level KV interface we can consume that would be 
great--especially if we can glue it to our KeyValueDB abstract API.  Even so, 
we need to make sure that the object *data* also has an efficient API we can 
utilize that efficiently handles block-sized/aligned data.

sage


>Either way, I strongly support to have CEPH own data format instead 
> of relying on filesystem.
> 
>   Regards,
>   James
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get 
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v 
> > dbs (for storing allocators and all). The reason is the unknown 
> > write amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it 
> as
> appropriate) so that other backends can be easily swapped in (e.g. a 
> btree-based one for high-end flash).
> 
> sage
> 
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org 
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> > 
> > The current design is based on two simple ideas:
> > 
> >  1) a key/value interface is better way to manage all of our 
> > internal metadata (object metadata, attrs, layout, collection 
> > membership, write-ahead logging, overlay data, etc.)
> > 
> >  2) a file system is well suited for storage object data (as files).
> > 
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  
> > A few
> > things:
> > 
> >  - We currently write the data to the file, fsync, then commit the 
> > kv transaction.  That's at least 3 IOs: one for the data, one for 
> > the fs journal, one for the kv txn to commit (at least once my 
> > rocksdb changes land... the kv commit is currently 2-3).  So two 
> > people are managing metadata

RE: newstore direction

2015-10-20 Thread Sage Weil
On Tue, 20 Oct 2015, James (Fei) Liu-SSI wrote:
> Hi Sage, 
>Sorry for confusing you. SSDs with key value interfaces are still 
> under development by several vendors.  It has totally different design 
> approach than Open Channel SSD. I met Matias several months ago and 
> discussed about possibilities to have key value interface support with 
> Open Channel SSD . I am not following the progress since then. If Matias 
> is in this group, He will definitely can give us better explanations. 
> Here is his presentation for key value support with open channel SSD for 
> your reference.
> 
> http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf

Ok cool.  I saw Matias' talk at Vault and was very pleased to see that 
there is some real effort to get away from black box FTLs.

And I am eagerly awaiting the arrival of SSDs with a kv interface... open 
channel especially, but even proprietary devices exposing kv would be an 
improvement over proprietary devices exposing block.  :)

sage


> 
> 
>   Regards,
>   James  
> 
> -Original Message-
> From: Sage Weil [mailto:sw...@redhat.com] 
> Sent: Tuesday, October 20, 2015 5:34 AM
> To: James (Fei) Liu-SSI
> Cc: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote:
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive solution than 
> > raw block device base keyvalue store as backend for objectstore. The 
> > new key value SSD device with transaction support would be ideal to 
> > solve the issues. First of all, it is raw SSD device. Secondly , It 
> > provides key value interface directly from SSD. Thirdly, it can 
> > provide transaction support, consistency will be guaranteed by hardware 
> > device.
> > It pretty much satisfied all of objectstore needs without any extra 
> > overhead since there is not any extra layer in between device and 
> > objectstore.
> 
> Are you talking about open channel SSDs?  Or something else?  Everything I'm 
> familiar with that is currently shipping is exposing a vanilla block 
> interface (conventional SSDs) that hides all of that or NVMe (which isn't 
> much better).
> 
> If there is a low-level KV interface we can consume that would be 
> great--especially if we can glue it to our KeyValueDB abstract API.  Even so, 
> we need to make sure that the object *data* also has an efficient API we can 
> utilize that efficiently handles block-sized/aligned data.
> 
> sage
> 
> 
> >Either way, I strongly support to have CEPH own data format instead 
> > of relying on filesystem.
> > 
> >   Regards,
> >   James
> > 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org 
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> > 
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get 
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v 
> > > dbs (for storing allocators and all). The reason is the unknown 
> > > write amps they causes.
> > 
> > My hope is to keep behing the KeyValueDB interface (and/more change it 
> > as
> > appropriate) so that other backends can be easily swapped in (e.g. a 
> > btree-based one for high-end flash).
> > 
> > sage
> > 
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -Original Message-
> > > From: ceph-devel-ow...@vger.kernel.org 
> > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > > 
> > > The current design is based on two simple ideas:
> > > 
> > >  1) a key/value interface is better way to manage all of our 
> > > internal metadata (object metadata, attrs, layout, collection 
> > > membership, write-ahead logging, overlay data, etc.)
> > > 
> > >  2) a file system is well suited for storage object data (as files).
> > > 
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.  
> > > A few
> > > things:
> > > 
> > >  - We curren

Re: newstore direction

2015-10-20 Thread Matt Benjamin
We mostly assumed that sort-of transactional file systems, perhaps hosted in 
user space was the most tractable trajectory.  I have seen newstore and 
keyvalue store as essentially congruent approaches using database primitives 
(and I am interested in what you make of Russell Sears).  I'm skeptical of any 
hope of keeping things "simple."  Like Martin downthread, most systems I havce 
seen (filers, ZFS)) make use of a fast, durable commit log and then flex 
out...something else.

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-707-0660
fax.  734-769-8938
cel.  734-216-5309


- Original Message -
> From: "Sage Weil" <sw...@redhat.com>
> To: "John Spray" <jsp...@redhat.com>
> Cc: "Ceph Development" <ceph-devel@vger.kernel.org>
> Sent: Tuesday, October 20, 2015 4:00:23 PM
> Subject: Re: newstore direction
> 
> On Tue, 20 Oct 2015, John Spray wrote:
> > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sw...@redhat.com> wrote:
> > >  - We have to size the kv backend storage (probably still an XFS
> > > partition) vs the block storage.  Maybe we do this anyway (put metadata
> > > on
> > > SSD!) so it won't matter.  But what happens when we are storing gobs of
> > > rgw index data or cephfs metadata?  Suddenly we are pulling storage out
> > > of
> > > a different pool and those aren't currently fungible.
> > 
> > This is the concerning bit for me -- the other parts one "just" has to
> > get the code right, but this problem could linger and be something we
> > have to keep explaining to users indefinitely.  It reminds me of cases
> > in other systems where users had to make an educated guess about inode
> > size up front, depending on whether you're expecting to efficiently
> > store a lot of xattrs.
> > 
> > In practice it's rare for users to make these kinds of decisions well
> > up-front: it really needs to be adjustable later, ideally
> > automatically.  That could be pretty straightforward if the KV part
> > was stored directly on block storage, instead of having XFS in the
> > mix.  I'm not quite up with the state of the art in this area: are
> > there any reasonable alternatives for the KV part that would consume
> > some defined range of a block device from userspace, instead of
> > sitting on top of a filesystem?
> 
> I agree: this is my primary concern with the raw block approach.
> 
> There are some KV alternatives that could consume block, but the problem
> would be similar: we need to dynamically size up or down the kv portion of
> the device.
> 
> I see two basic options:
> 
> 1) Wire into the Env abstraction in rocksdb to provide something just
> smart enough to let rocksdb work.  It isn't much: named files (not that
> many--we could easily keep the file table in ram), always written
> sequentially, to be read later with random access. All of the code is
> written around abstractions of SequentialFileWriter so that everything
> posix is neatly hidden in env_posix (and there are various other env
> implementations for in-memory mock tests etc.).
> 
> 2) Use something like dm-thin to sit between the raw block device and XFS
> (for rocksdb) and the block device consumed by newstore.  As long as XFS
> doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb
> files in their entirety) we can fstrim and size down the fs portion.  If
> we similarly make newstores allocator stick to large blocks only we would
> be able to size down the block portion as well.  Typical dm-thin block
> sizes seem to range from 64KB to 512KB, which seems reasonable enough to
> me.  In fact, we could likely just size the fs volume at something
> conservatively large (like 90%) and rely on -o discard or periodic fstrim
> to keep its actual utilization in check.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: newstore direction

2015-10-20 Thread Ric Wheeler

On 10/20/2015 05:47 PM, Sage Weil wrote:

On Tue, 20 Oct 2015, Gregory Farnum wrote:

On Tue, Oct 20, 2015 at 12:44 PM, Sage Weil  wrote:

On Tue, 20 Oct 2015, Ric Wheeler wrote:

The big problem with consuming block devices directly is that you ultimately
end up recreating most of the features that you had in the file system. Even
enterprise databases like Oracle and DB2 have been migrating away from running
on raw block devices in favor of file systems over time.  In effect, you are
looking at making a simple on disk file system which is always easier to start
than it is to get back to a stable, production ready state.

This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had
everything we were implementing and more: mainly, copy on write and data
checksums.  But in practice the fact that its general purpose means it
targets a very different workloads and APIs than what we need.

Try 7 years since ebofs...

Sigh...


That's one of my concerns, though. You ditched ebofs once already
because it had metastasized into an entire FS, and had reached its
limits of maintainability. What makes you think a second time through
would work better? :/

A fair point, and I've given this some thought:

1) We know a *lot* more about our workload than I did in 2005.  The things
I was worrying about then (fragmentation, mainly) are much easier to
address now, where we have hints from rados and understand what the write
patterns look like in practice (randomish 4k-128k ios for rbd, sequential
writes for rgw, and the cephfs wildcard).

2) Most of the ebofs effort was around doing copy-on-write btrees (with
checksums) and orchestrating commits.  Here our job is *vastly* simplified
by assuming the existence of a transactional key/value store.  If you look
at newstore today, we're already half-way through dealing with the
complexity of doing allocations... we're essentially "allocating" blocks
that are 1 MB files on XFS, managing that metadata, and overwriting or
replacing those blocks on write/truncate/clone.  By the time we add in an
allocator (get_blocks(len), free_block(offset, len)) and rip out all the
file handling fiddling (like fsync workqueues, file id allocator,
file truncation fiddling, etc.) we'll probably have something working
with about the same amount of code we have now.  (Of course, that'll
grow as we get more sophisticated, but that'll happen either way.)


On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil  wrote:

  - 2 IOs for most: one to write the data to unused space in the block
device, one to commit our transaction (vs 4+ before).  For overwrites,
we'd have one io to do our write-ahead log (kv journal), then do
the overwrite async (vs 4+ before).

I can't work this one out. If you're doing one write for the data and
one for the kv journal (which is on another filesystem), how does the
commit sequence work that it's only 2 IOs instead of the same 3 we
already have? Or are you planning to ditch the LevelDB/RocksDB store
for our journaling and just use something within the block layer?

Now:
 1 io  to write a new file
   1-2 ios to sync the fs journal (commit the inode, alloc change)
   (I see 2 journal IOs on XFS and only 1 on ext4...)
 1 io  to commit the rocksdb journal (currently 3, but will drop to
   1 with xfs fix and my rocksdb change)


I think that might be too pessimistic - the number of discrete IO's sent down to 
a spinning disk make much less impact on performance than the number of 
fsync()'s since they IO's all land in the write cache.  Some newer spinning 
drives have a non-volatile write cache, so even an fsync() might not end up 
doing the expensive data transfer to the platter.


It would be interesting to get the timings on the IO's you see to measure the 
actual impact.





With block:
 1 io to write to block device
 1 io to commit to rocksdb journal


If we do want to go down this road, we shouldn't need to write an
allocator from scratch. I don't remember exactly which ones it is but
we've read/seen at least a few storage papers where people have reused
existing allocators  ? I think the one from ext2? And somebody managed
to get it running in userspace.

Maybe, but the real win is when we combine the allocator state update with
our kv transaction.  Even if we adopt an existing algorithm we'll need to
do some significant rejiggering to persist it in the kv store.

My thought is start with something simple that works (e.g., linear sweep
over free space, simple interval_set<>-style freelist) and once it works
look at existing state of the art for a clever v2.

BTW, I suspect a modest win here would be to simply use the collection/pg
as a hint for storing related objects.  That's the best indicator we have
for aligned lifecycle (think PG migrations/deletions vs flash erase
blocks).  Good luck plumbing that through XFS...


Of course, then we also need to figure out how to get checksums on the
block data, 

Re: newstore direction

2015-10-20 Thread Mark Nelson

On 10/20/2015 07:30 AM, Sage Weil wrote:

On Tue, 20 Oct 2015, Chen, Xiaoxi wrote:

+1, nowadays K-V DB care more about very small key-value pairs, say
several bytes to a few KB, but in SSD case we only care about 4KB or
8KB. In this way, NVMKV is a good design and seems some of the SSD
vendor are also trying to build this kind of interface, we had a NVM-L
library but still under development.


Do you have an NVMKV link?  I see a paper and a stale github repo.. not
sure if I'm looking at the right thing.

My concern with using a key/value interface for the object data is that
you end up with lots of key/value pairs (e.g., $inode_$offset =
$4kb_of_data) that is pretty inefficient to store and (depending on the
implementation) tends to break alignment.  I don't think these interfaces
are targetted toward block-sized/aligned payloads.  Storing just the
metadata (block allocation map) w/ the kv api and storing the data
directly on a block/page interface makes more sense to me.

sage


I get the feeling that some of the folks that were involved with nvmkv 
at Fusion IO have left.  Nisha Talagala is now out at Parallel Systems 
for instance.  http://pmem.io might be a better bet, though I haven't 
looked closely at it.


Mark





-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
Sent: Tuesday, October 20, 2015 6:21 AM
To: Sage Weil; Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

Hi Sage and Somnath,
   In my humble opinion, There is another more aggressive  solution than raw
block device base keyvalue store as backend for objectstore. The new key
value  SSD device with transaction support would be  ideal to solve the issues.
First of all, it is raw SSD device. Secondly , It provides key value interface
directly from SSD. Thirdly, it can provide transaction support, consistency will
be guaranteed by hardware device. It pretty much satisfied all of objectstore
needs without any extra overhead since there is not any extra layer in
between device and objectstore.
Either way, I strongly support to have CEPH own data format instead of
relying on filesystem.

   Regards,
   James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 1:55 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, Somnath Roy wrote:

Sage,
I fully support that.  If we want to saturate SSDs , we need to get
rid of this filesystem overhead (which I am in process of measuring).
Also, it will be good if we can eliminate the dependency on the k/v
dbs (for storing allocators and all). The reason is the unknown write
amps they causes.


My hope is to keep behing the KeyValueDB interface (and/more change it as
appropriate) so that other backends can be easily swapped in (e.g. a btree-
based one for high-end flash).

sage




Thanks & Regards
Somnath


-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 12:49 PM
To: ceph-devel@vger.kernel.org
Subject: newstore direction

The current design is based on two simple ideas:

  1) a key/value interface is better way to manage all of our internal
metadata (object metadata, attrs, layout, collection membership,
write-ahead logging, overlay data, etc.)

  2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A
few
things:

  - We currently write the data to the file, fsync, then commit the kv
transaction.  That's at least 3 IOs: one for the data, one for the fs
journal, one for the kv txn to commit (at least once my rocksdb
changes land... the kv commit is currently 2-3).  So two people are
managing metadata, here: the fs managing the file metadata (with its
own
journal) and the kv backend (with its journal).

  - On read we have to open files by name, which means traversing the fs

namespace.  Newstore tries to keep it as flat and simple as possible, but at a
minimum it is a couple btree lookups.  We'd love to use open by handle
(which would reduce this to 1 btree traversal), but running the daemon as
ceph and not root makes that hard...


  - ...and file systems insist on updating mtime on writes, even when it is a

overwrite with no allocation changes.  (We don't care about mtime.)
O_NOCMTIME patches exist but it is hard to get these past the kernel
brainfreeze.


  - XFS is (probably) never going going to give us data checksums, which we

want desperately.


But what's the alternative?  My thought is to just bite the bullet and

consume a raw block device directly.  Write an allocator, hopefully keep it
pretty simple, and manage it in kv store along with all of our other metadata.


Wins:

  - 2 IOs for most: one to write the 

RE: newstore direction

2015-10-19 Thread Somnath Roy
Sage,
I fully support that.  If we want to saturate SSDs , we need to get rid of this 
filesystem overhead (which I am in process of measuring).
Also, it will be good if we can eliminate the dependency on the k/v dbs (for 
storing allocators and all). The reason is the unknown write amps they causes.

Thanks & Regards
Somnath


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 12:49 PM
To: ceph-devel@vger.kernel.org
Subject: newstore direction

The current design is based on two simple ideas:

 1) a key/value interface is better way to manage all of our internal metadata 
(object metadata, attrs, layout, collection membership, write-ahead logging, 
overlay data, etc.)

 2) a file system is well suited for storage object data (as files).

So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
things:

 - We currently write the data to the file, fsync, then commit the kv 
transaction.  That's at least 3 IOs: one for the data, one for the fs journal, 
one for the kv txn to commit (at least once my rocksdb changes land... the kv 
commit is currently 2-3).  So two people are managing metadata, here: the fs 
managing the file metadata (with its own
journal) and the kv backend (with its journal).

 - On read we have to open files by name, which means traversing the fs 
namespace.  Newstore tries to keep it as flat and simple as possible, but at a 
minimum it is a couple btree lookups.  We'd love to use open by handle (which 
would reduce this to 1 btree traversal), but running the daemon as ceph and not 
root makes that hard...

 - ...and file systems insist on updating mtime on writes, even when it is a 
overwrite with no allocation changes.  (We don't care about mtime.) O_NOCMTIME 
patches exist but it is hard to get these past the kernel brainfreeze.

 - XFS is (probably) never going going to give us data checksums, which we want 
desperately.

But what's the alternative?  My thought is to just bite the bullet and consume 
a raw block device directly.  Write an allocator, hopefully keep it pretty 
simple, and manage it in kv store along with all of our other metadata.

Wins:

 - 2 IOs for most: one to write the data to unused space in the block device, 
one to commit our transaction (vs 4+ before).  For overwrites, we'd have one io 
to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
before).

 - No concern about mtime getting in the way

 - Faster reads (no fs lookup)

 - Similarly sized metadata for most objects.  If we assume most objects are 
not fragmented, then the metadata to store the block offsets is about the same 
size as the metadata to store the filenames we have now.

Problems:

 - We have to size the kv backend storage (probably still an XFS
partition) vs the block storage.  Maybe we do this anyway (put metadata on
SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
index data or cephfs metadata?  Suddenly we are pulling storage out of a 
different pool and those aren't currently fungible.

 - We have to write and maintain an allocator.  I'm still optimistic this can 
be reasonbly simple, especially for the flash case (where fragmentation isn't 
such an issue as long as our blocks are reasonbly sized).  For disk we may beed 
to be moderately clever.

 - We'll need a fsck to ensure our internal metadata is consistent.  The good 
news is it'll just need to validate what we have stored in the kv store.

Other thoughts:

 - We might want to consider whether dm-thin or bcache or other block layers 
might help us with elasticity of file vs block areas.

 - Rocksdb can push colder data to a second directory, so we could have a fast 
ssd primary area (for wal and most metadata) and a second hdd directory for 
stuff it has to push off.  Then have a conservative amount of file space on the 
hdd.  If our block fills up, use the existing file mechanism to put data there 
too.  (But then we have to maintain both the current kv + file approach and not 
go all-in on kv + block.)

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored 

Re: newstore direction

2015-10-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I think there is a lot that can be gained by Ceph managing a raw block
device. As I mentioned on ceph-users, I've given this some though and
a lot of optimizations could be done that is conducive to storing
objects. I didn't think however to bypass VFS all together by opening
the raw device directly, but this would make things simpler as you
don't have to program things for VFS that don't make sense.

Some of my thoughts were to employ a hashing algorithm for inode
lookup (CRUSH like). Is there a good use case for listing a directory?
We may need to keep a list for deletion, but there may be a better way
to handle this. Is there a need to do snapshots at the block layer if
operations can be atomic? Is there a real advantage to have an
allocation as small as 4K, or does it make since to use something like
512K?

I'm interested in how this might pan out.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWJVEACRDmVDuy+mK58QAAIQEQAK9GUmGQBP1wYa9yXNEp
juofzj5SCxxiNCBdY3kkdHXELCWkLGn331JX2El8h1lPaqH8/nWNy4U6hx0s
7A5EBgQp7+LN03OLroSfiSccPhEe5B/OB1cnyZjmxwDXyaMJzqXwn231f5ev
lBEzvU5PpHrMdNIIGxNFEHgduxfPIw5ciOokP27Tle1JdAGSn6fL6nRLtQfd
HmVLnnXJT9zaGRyxnL8ZQU8IlfjfhMpIc1bM3QKkQkBmXanzCaNaULrlO35L
XtIy0fEXAjkcGHpxOTz4yx5OFKwkpirFduU2PBn+5kqxPRvGL/eEzIxTV89c
SfhAkyBFpl+g7G+q532i7L/34r2wXOL7wcn9seLdOZIt1LVnb059r0tpy4Fz
X/V2/ao1Fua2BFMYzMskPXiKFzxLu/jOS12CjvYWkNhN4C2pGUbRxhqYnC0k
gjRpoOZHDr+RogQdlzXeUmcbZzvtwWqk2uECIX2mLR1aHTVgnpegJhvvHdl3
Nm7jxLyTof2bcXQgSwO5YEXvWO3dNfQynrb5zE+aIVM5ps9D95Mmm94lJtda
47zraQNwrL1OVS7Fd4ot9VepLcQ4orCUZPSqrm5FBlBWj5G+/U0F8VQl8u/g
/nSZrxMXjHJWRhFvzFMYC3yUp59N75LXR5wId8RkAkgZVM+PftB4LmB7spHC
WcGR
=j3i1
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Oct 19, 2015 at 1:49 PM, Sage Weil  wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.
>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).
>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate 

RE: newstore direction

2015-10-19 Thread Sage Weil
On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get rid 
> of this filesystem overhead (which I am in process of measuring). Also, 
> it will be good if we can eliminate the dependency on the k/v dbs (for 
> storing allocators and all). The reason is the unknown write amps they 
> causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as 
appropriate) so that other backends can be easily swapped in (e.g. a 
btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, write-ahead 
> logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but at 
> a minimum it is a couple btree lookups.  We'd love to use open by handle 
> (which would reduce this to 1 btree traversal), but running the daemon as 
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a 
> overwrite with no allocation changes.  (We don't care about mtime.) 
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep it 
> pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, 
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one 
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are 
> not fragmented, then the metadata to store the block offsets is about the 
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
> index data or cephfs metadata?  Suddenly we are pulling storage out of a 
> different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be reasonbly simple, especially for the flash case (where fragmentation isn't 
> such an issue as long as our blocks are reasonbly sized).  For disk we may 
> beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The good 
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block layers 
> might help us with elasticity of file vs block areas.
> 
>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd directory 
> for stuff it has to push off.  Then have a conservative amount of file space 
> on the hdd.  If our block fills up, use the existing file mechanism to put 
> data there too.  (But then we have to maintain both the current kv + file 
> approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this 

Re: newstore direction

2015-10-19 Thread Wido den Hollander
On 10/19/2015 09:49 PM, Sage Weil wrote:
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few 
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb changes 
> land... the kv commit is currently 2-3).  So two people are managing 
> metadata, here: the fs managing the file metadata (with its own 
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but 
> at a minimum it is a couple btree lookups.  We'd love to use open by 
> handle (which would reduce this to 1 btree traversal), but running 
> the daemon as ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is 
> a overwrite with no allocation changes.  (We don't care about mtime.)  
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep 
> it pretty simple, and manage it in kv store along with all of our other 
> metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block 
> device, one to commit our transaction (vs 4+ before).  For overwrites, 
> we'd have one io to do our write-ahead log (kv journal), then do 
> the overwrite async (vs 4+ before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects 
> are not fragmented, then the metadata to store the block offsets is about 
> the same size as the metadata to store the filenames we have now. 
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS 
> partition) vs the block storage.  Maybe we do this anyway (put metadata on 
> SSD!) so it won't matter.  But what happens when we are storing gobs of 
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of 
> a different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this 
> can be reasonbly simple, especially for the flash case (where 
> fragmentation isn't such an issue as long as our blocks are reasonbly 
> sized).  For disk we may beed to be moderately clever.
> 
>  - We'll need a fsck to ensure our internal metadata is consistent.  The 
> good news is it'll just need to validate what we have stored in the kv 
> store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block 
> layers might help us with elasticity of file vs block areas.
> 

I've been using bcache for a while now in production and that helped a lot.

Intel SSDs with GPT. First few partitions as Journals and then one big
partition for bcache.

/dev/bcache02.8T  264G  2.5T  10% /var/lib/ceph/osd/ceph-60
/dev/bcache12.8T  317G  2.5T  12% /var/lib/ceph/osd/ceph-61
/dev/bcache22.8T  303G  2.5T  11% /var/lib/ceph/osd/ceph-62
/dev/bcache32.8T  316G  2.5T  12% /var/lib/ceph/osd/ceph-63
/dev/bcache42.8T  167G  2.6T   6% /var/lib/ceph/osd/ceph-64
/dev/bcache52.8T  295G  2.5T  11% /var/lib/ceph/osd/ceph-65

The maintainers from bcache also presented bcachefs:
https://lkml.org/lkml/2015/8/21/22

"checksumming, compression: currently only zlib is supported for
compression, and for checksumming there's crc32c and a 64 bit checksum."

Wouldn't that be something that can be leveraged from? Consuming a raw
block device seems like re-inventing the wheel to me. I might be wrong
though.

I have no idea how stable bcachefs is, but it might be worth looking in to.

>  - Rocksdb can push colder data to a second directory, so we could have a 
> fast ssd primary area (for wal and most metadata) and a second hdd 
> directory for stuff it has to push off.  Then have a conservative amount 
> of file space on the hdd.  If our block fills up, use the existing file 
> mechanism to put data there too.  (But then we have to maintain both the 
> current kv + file approach and not go all-in on kv + block.)
> 
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

RE: newstore direction

2015-10-19 Thread Varada Kari
Hi Sage,

If we are managing the raw device, does it make sense to have a key value store 
to manage the whole space? 
Having metadata of the allocator might cause some other problems of 
consistency. Getting an fsck for that implementation can be tougher, we might 
have to have strict crc computations on the data. And have to manage sanity of 
the DB managing them.
If we can have a common mechanism of having data and metadata the same keyvalue 
store, will improve the performance. 
We have integrated a custom made key value store which works on raw device the 
key value store backend. And we have observed better bw utilization and iops.
Read/writes can be faster and no fslookup needed. We have tools like fsck to 
care of consistency of DB. 

Couple of comments inline.

Thanks,
Varada

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Tuesday, October 20, 2015 1:19 AM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal metadata
> (object metadata, attrs, layout, collection membership, write-ahead logging,
> overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one
> for the kv txn to commit (at least once my rocksdb changes land... the kv
> commit is currently 2-3).  So two people are managing metadata, here: the fs
> managing the file metadata (with its own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and consume
> a raw block device directly.  Write an allocator, hopefully keep it pretty
> simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are
> not fragmented, then the metadata to store the block offsets is about the
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw
> index data or cephfs metadata?  Suddenly we are pulling storage out of a
> different pool and those aren't currently fungible.

[Varada Kari]  Ideally if we can manage the raw device as key value store 
indirection to manage metadata and data both, we can benefit with faster 
lookups and writes (if the KVStore supports a batch atomic transactional 
write). SSD's might suffer with more write  amplification by putting the meta 
data alone, if we can manage this part(KV Store to deal with raw device) 
also(handling small writes) we can avoid write amplification and get better 
throughput from the device.

>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be
> reasonbly simple, especially for the flash case (where fragmentation isn't
> such an issue as long as our blocks are reasonbly sized).  For disk we may
> beed to be moderately clever.
> 
[Varada Kari] Yes. If the writes are aligned to flash programmable page size, 
that will not cause any issues. But writes less than programmable page size 
will cause internal fragmentation. Repeated overwrites to the same, will cause 
more write amplification.

>  - We'll need a fsck to ensure our internal metadata is consistent.  The good
> news is it'll just need to validate what we have stored in the kv store.
> 
> Other thoughts:
> 
>  - We might want to consider whether dm-thin or bcache or other block
> layers 

RE: newstore direction

2015-10-19 Thread James (Fei) Liu-SSI
Hi Sage and Somnath,
  In my humble opinion, There is another more aggressive  solution than raw 
block device base keyvalue store as backend for objectstore. The new key value  
SSD device with transaction support would be  ideal to solve the issues. First 
of all, it is raw SSD device. Secondly , It provides key value interface 
directly from SSD. Thirdly, it can provide transaction support, consistency 
will be guaranteed by hardware device. It pretty much satisfied all of 
objectstore needs without any extra overhead since there is not any extra layer 
in between device and objectstore. 
   Either way, I strongly support to have CEPH own data format instead of 
relying on filesystem.  

  Regards,
  James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: Monday, October 19, 2015 1:55 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org
Subject: RE: newstore direction

On Mon, 19 Oct 2015, Somnath Roy wrote:
> Sage,
> I fully support that.  If we want to saturate SSDs , we need to get 
> rid of this filesystem overhead (which I am in process of measuring). 
> Also, it will be good if we can eliminate the dependency on the k/v 
> dbs (for storing allocators and all). The reason is the unknown write 
> amps they causes.

My hope is to keep behing the KeyValueDB interface (and/more change it as
appropriate) so that other backends can be easily swapped in (e.g. a 
btree-based one for high-end flash).

sage


> 
> Thanks & Regards
> Somnath
> 
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 12:49 PM
> To: ceph-devel@vger.kernel.org
> Subject: newstore direction
> 
> The current design is based on two simple ideas:
> 
>  1) a key/value interface is better way to manage all of our internal 
> metadata (object metadata, attrs, layout, collection membership, 
> write-ahead logging, overlay data, etc.)
> 
>  2) a file system is well suited for storage object data (as files).
> 
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A 
> few
> things:
> 
>  - We currently write the data to the file, fsync, then commit the kv 
> transaction.  That's at least 3 IOs: one for the data, one for the fs 
> journal, one for the kv txn to commit (at least once my rocksdb 
> changes land... the kv commit is currently 2-3).  So two people are 
> managing metadata, here: the fs managing the file metadata (with its 
> own
> journal) and the kv backend (with its journal).
> 
>  - On read we have to open files by name, which means traversing the fs 
> namespace.  Newstore tries to keep it as flat and simple as possible, but at 
> a minimum it is a couple btree lookups.  We'd love to use open by handle 
> (which would reduce this to 1 btree traversal), but running the daemon as 
> ceph and not root makes that hard...
> 
>  - ...and file systems insist on updating mtime on writes, even when it is a 
> overwrite with no allocation changes.  (We don't care about mtime.) 
> O_NOCMTIME patches exist but it is hard to get these past the kernel 
> brainfreeze.
> 
>  - XFS is (probably) never going going to give us data checksums, which we 
> want desperately.
> 
> But what's the alternative?  My thought is to just bite the bullet and 
> consume a raw block device directly.  Write an allocator, hopefully keep it 
> pretty simple, and manage it in kv store along with all of our other metadata.
> 
> Wins:
> 
>  - 2 IOs for most: one to write the data to unused space in the block device, 
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one 
> io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ 
> before).
> 
>  - No concern about mtime getting in the way
> 
>  - Faster reads (no fs lookup)
> 
>  - Similarly sized metadata for most objects.  If we assume most objects are 
> not fragmented, then the metadata to store the block offsets is about the 
> same size as the metadata to store the filenames we have now.
> 
> Problems:
> 
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put 
> metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of rgw 
> index data or cephfs metadata?  Suddenly we are pulling storage out of a 
> different pool and those aren't currently fungible.
> 
>  - We have to write and maintain an allocator.  I'm still optimistic this can 
> be reasonbly simple, especially for the flash case (where fragmentation isn't 
> such an issue as long as our blocks are reasonbly sized).  For disk we may 

Re: newstore direction

2015-10-19 Thread John Spray
On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.

This is the concerning bit for me -- the other parts one "just" has to
get the code right, but this problem could linger and be something we
have to keep explaining to users indefinitely.  It reminds me of cases
in other systems where users had to make an educated guess about inode
size up front, depending on whether you're expecting to efficiently
store a lot of xattrs.

In practice it's rare for users to make these kinds of decisions well
up-front: it really needs to be adjustable later, ideally
automatically.  That could be pretty straightforward if the KV part
was stored directly on block storage, instead of having XFS in the
mix.  I'm not quite up with the state of the art in this area: are
there any reasonable alternatives for the KV part that would consume
some defined range of a block device from userspace, instead of
sitting on top of a filesystem?

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: newstore direction

2015-10-19 Thread Chen, Xiaoxi
+1,  nowadays K-V DB care more about very small key-value pairs, say several 
bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, 
NVMKV is a good design and seems some of the SSD vendor are also trying to 
build this kind of interface, we had a NVM-L library but still under 
development.
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, October 20, 2015 6:21 AM
> To: Sage Weil; Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive  solution than raw
> block device base keyvalue store as backend for objectstore. The new key
> value  SSD device with transaction support would be  ideal to solve the 
> issues.
> First of all, it is raw SSD device. Secondly , It provides key value interface
> directly from SSD. Thirdly, it can provide transaction support, consistency 
> will
> be guaranteed by hardware device. It pretty much satisfied all of objectstore
> needs without any extra overhead since there is not any extra layer in
> between device and objectstore.
>Either way, I strongly support to have CEPH own data format instead of
> relying on filesystem.
> 
>   Regards,
>   James
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v
> > dbs (for storing allocators and all). The reason is the unknown write
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-
> based one for high-end flash).
> 
> sage
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> >
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb
> > changes land... the kv commit is currently 2-3).  So two people are
> > managing metadata, here: the fs managing the file metadata (with its
> > own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep it
> pretty simple, and manage it in kv store along with all of our other metadata.
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block 
> > device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have one
> io to do our write-ahead log (kv journal), then do the overwrite async (vs

RE: newstore direction

2015-10-19 Thread Chen, Xiaoxi
There is something like : http://pmem.io/nvml/libpmemobj/ to adapt NVMe to 
transactional object storage.

But definitely need some more works

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Varada Kari
> Sent: Tuesday, October 20, 2015 10:33 AM
> To: James (Fei) Liu-SSI; Sage Weil; Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi James,
> 
> Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ?
> If SCSI OSD is what you are mentioning, drive has to support all osd
> functionality mentioned by T10.
> If not, we have to implement the same functionality in kernel or have a
> wrapper in user space to convert them to read/write calls.  This seems more
> effort.
> 
> Varada
> 
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> > Sent: Tuesday, October 20, 2015 3:51 AM
> > To: Sage Weil <sw...@redhat.com>; Somnath Roy
> > <somnath@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> >
> > Hi Sage and Somnath,
> >   In my humble opinion, There is another more aggressive  solution
> > than raw block device base keyvalue store as backend for objectstore.
> > The new key value  SSD device with transaction support would be  ideal
> > to solve the issues. First of all, it is raw SSD device. Secondly , It
> > provides key value interface directly from SSD. Thirdly, it can
> > provide transaction support, consistency will be guaranteed by
> > hardware device. It pretty much satisfied all of objectstore needs
> > without any extra overhead since there is not any extra layer in between
> device and objectstore.
> >Either way, I strongly support to have CEPH own data format instead
> > of relying on filesystem.
> >
> >   Regards,
> >   James
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> > ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 1:55 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: newstore direction
> >
> > On Mon, 19 Oct 2015, Somnath Roy wrote:
> > > Sage,
> > > I fully support that.  If we want to saturate SSDs , we need to get
> > > rid of this filesystem overhead (which I am in process of measuring).
> > > Also, it will be good if we can eliminate the dependency on the k/v
> > > dbs (for storing allocators and all). The reason is the unknown
> > > write amps they causes.
> >
> > My hope is to keep behing the KeyValueDB interface (and/more change it
> > as
> > appropriate) so that other backends can be easily swapped in (e.g. a
> > btree- based one for high-end flash).
> >
> > sage
> >
> >
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > >
> > > -Original Message-
> > > From: ceph-devel-ow...@vger.kernel.org
> > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > > Sent: Monday, October 19, 2015 12:49 PM
> > > To: ceph-devel@vger.kernel.org
> > > Subject: newstore direction
> > >
> > > The current design is based on two simple ideas:
> > >
> > >  1) a key/value interface is better way to manage all of our
> > > internal metadata (object metadata, attrs, layout, collection
> > > membership, write-ahead logging, overlay data, etc.)
> > >
> > >  2) a file system is well suited for storage object data (as files).
> > >
> > > So far 1 is working out well, but I'm questioning the wisdom of #2.
> > > A few
> > > things:
> > >
> > >  - We currently write the data to the file, fsync, then commit the
> > > kv transaction.  That's at least 3 IOs: one for the data, one for
> > > the fs journal, one for the kv txn to commit (at least once my
> > > rocksdb changes land... the kv commit is currently 2-3).  So two
> > > people are managing metadata, here: the fs managing the file
> > > metadata (with its own
> > > journal) and the kv backend (with its journal).
> > >
> > >  - On read we have to open files by name, which means traversing the
> > > fs
> > namespace.  Newstore tries to keep it as flat and simple as possible,
> > but at a minimum it is a couple btree lookups.  We'd love to use open
> > by handle 

RE: newstore direction

2015-10-19 Thread Varada Kari
Hi James,

Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? If 
SCSI OSD is what you are mentioning, drive has to support all osd functionality 
mentioned by T10.
If not, we have to implement the same functionality in kernel or have a wrapper 
in user space to convert them to read/write calls.  This seems more effort.

Varada

> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI
> Sent: Tuesday, October 20, 2015 3:51 AM
> To: Sage Weil <sw...@redhat.com>; Somnath Roy
> <somnath@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> Hi Sage and Somnath,
>   In my humble opinion, There is another more aggressive  solution than raw
> block device base keyvalue store as backend for objectstore. The new key
> value  SSD device with transaction support would be  ideal to solve the
> issues. First of all, it is raw SSD device. Secondly , It provides key value
> interface directly from SSD. Thirdly, it can provide transaction support,
> consistency will be guaranteed by hardware device. It pretty much satisfied
> all of objectstore needs without any extra overhead since there is not any
> extra layer in between device and objectstore.
>Either way, I strongly support to have CEPH own data format instead of
> relying on filesystem.
> 
>   Regards,
>   James
> 
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
> ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, October 19, 2015 1:55 PM
> To: Somnath Roy
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: newstore direction
> 
> On Mon, 19 Oct 2015, Somnath Roy wrote:
> > Sage,
> > I fully support that.  If we want to saturate SSDs , we need to get
> > rid of this filesystem overhead (which I am in process of measuring).
> > Also, it will be good if we can eliminate the dependency on the k/v
> > dbs (for storing allocators and all). The reason is the unknown write
> > amps they causes.
> 
> My hope is to keep behing the KeyValueDB interface (and/more change it as
> appropriate) so that other backends can be easily swapped in (e.g. a btree-
> based one for high-end flash).
> 
> sage
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -Original Message-
> > From: ceph-devel-ow...@vger.kernel.org
> > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Monday, October 19, 2015 12:49 PM
> > To: ceph-devel@vger.kernel.org
> > Subject: newstore direction
> >
> > The current design is based on two simple ideas:
> >
> >  1) a key/value interface is better way to manage all of our internal
> > metadata (object metadata, attrs, layout, collection membership,
> > write-ahead logging, overlay data, etc.)
> >
> >  2) a file system is well suited for storage object data (as files).
> >
> > So far 1 is working out well, but I'm questioning the wisdom of #2.  A
> > few
> > things:
> >
> >  - We currently write the data to the file, fsync, then commit the kv
> > transaction.  That's at least 3 IOs: one for the data, one for the fs
> > journal, one for the kv txn to commit (at least once my rocksdb
> > changes land... the kv commit is currently 2-3).  So two people are
> > managing metadata, here: the fs managing the file metadata (with its
> > own
> > journal) and the kv backend (with its journal).
> >
> >  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but at a
> minimum it is a couple btree lookups.  We'd love to use open by handle
> (which would reduce this to 1 btree traversal), but running the daemon as
> ceph and not root makes that hard...
> >
> >  - ...and file systems insist on updating mtime on writes, even when it is a
> overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
> >
> >  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
> >
> > But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep it
> pretty simple, and manage it in kv store along with all of our other metadata.
> >
> > Wins:
> >
> >  - 2 IOs for most: one to write the data to unused space in the block 
> > device,
> one to commit our transaction (vs 4+ before).  For overwrites, we'd have

Re: newstore direction

2015-10-19 Thread Haomai Wang
On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil  wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.

This is really a tough decision. Although making a block device based
objectstore never walk out my mind since two years ago.

We would much more concern about the effective of space utilization
compared to local fs,  the buggy, the consuming time to build a tiny
local filesystem. I'm a little afraid of we would stuck into

>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
area from my perf.

>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>  - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)

A complex way...

Actually I would like to employ FileStore2 impl, which means we still
use FileJournal(or alike ..). But we need to employ more memory to
keep metadata/xattrs and use aio+dio to flush disk. A userspace
pagecache needed to be impl. Then we can skip journal if full write,
because osd is pg isolation we could make a barrier for single pg when
skipping journal. @Sage Is there other concerns for filestore skip
journal?

In a word, I like the model that filestore owns, but we need to have a
big refactor for existing impl.

Sorry to disturb the thought

>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--