Re: newstore direction
On Fri, Oct 23, 2015 at 7:59 AM, Howard Chuwrote: > If the stream of writes is large enough, you could omit fsync because > everything is being forced out of the cache to disk anyway. In that > scenario, the only thing that matters is that the writes get forced out in > the order you intended, so that an interruption or crash leaves you in a > known (or knowable) state vs unknown. The RADOS storage semantics actually require that we know it's durable on disk as well, unfortunately. But ordered writes would probably let us batch up commit points in ways that are a lot friendlier for the drives! -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
Ric Wheeler wrote: On 10/23/2015 07:06 AM, Ric Wheeler wrote: On 10/23/2015 02:21 AM, Howard Chu wrote: Normally, best practice is to use batching to avoid paying worst case latency >when you do a synchronous IO. Write a batch of files or appends without fsync, >then go back and fsync and you will pay that latency once (not per file/op). If filesystems would support ordered writes you wouldn't need to fsync at all. Just spit out a stream of writes and declare that batch N must be written before batch N+1. (Note that this is not identical to "write barriers", which imposed the same latencies as fsync by blocking all I/Os at a barrier boundary. Ordered writes may be freely interleaved with un-ordered writes, so normal I/O traffic can proceed unhindered. Their ordering is only enforced wrt other ordered writes.) One other note, the file & storage kernel people discussed using ordering years ago. One of the issues is that the devices themselves need to support. While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and still does not as far as I know?) support ordered tags. Yes, that's a bigger problem. ATA NCQ/TCQ aren't up to the job. >>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but >>> nothing above that layer makes use of it. >> >> I think that if the stream on either side of the barrier is large enough, >> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, >> should have the same performance. >> Not clear to me if we could do away with an fsync to trigger a cache flush >> here either - do SCSI ordered tags require that the writes be acknowledged >> only when durable, or can the device ack them once the target has them >> (including in a volatile write cache)? fsync() is too blunt a tool; its use gives you both C and D of ACID (Consistency and Durability). Ordered tags give you Consistency; there are lots of applications that can live without perfect Durability but losing Consistency is a major headache. If the stream of writes is large enough, you could omit fsync because everything is being forced out of the cache to disk anyway. In that scenario, the only thing that matters is that the writes get forced out in the order you intended, so that an interruption or crash leaves you in a known (or knowable) state vs unknown. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Thu, Oct 22, 2015 at 11:16 PM, Howard Chuwrote: > Milosz Tanski adfin.com> writes: > >> >> On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil redhat.com> wrote: >> > On Tue, 20 Oct 2015, John Spray wrote: >> >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil redhat.com> wrote: >> >> > - We have to size the kv backend storage (probably still an XFS >> >> > partition) vs the block storage. Maybe we do this anyway (put > metadata on >> >> > SSD!) so it won't matter. But what happens when we are storing gobs of >> >> > rgw index data or cephfs metadata? Suddenly we are pulling storage > out of >> >> > a different pool and those aren't currently fungible. >> >> >> >> This is the concerning bit for me -- the other parts one "just" has to >> >> get the code right, but this problem could linger and be something we >> >> have to keep explaining to users indefinitely. It reminds me of cases >> >> in other systems where users had to make an educated guess about inode >> >> size up front, depending on whether you're expecting to efficiently >> >> store a lot of xattrs. >> >> >> >> In practice it's rare for users to make these kinds of decisions well >> >> up-front: it really needs to be adjustable later, ideally >> >> automatically. That could be pretty straightforward if the KV part >> >> was stored directly on block storage, instead of having XFS in the >> >> mix. I'm not quite up with the state of the art in this area: are >> >> there any reasonable alternatives for the KV part that would consume >> >> some defined range of a block device from userspace, instead of >> >> sitting on top of a filesystem? >> > >> > I agree: this is my primary concern with the raw block approach. >> > >> > There are some KV alternatives that could consume block, but the problem >> > would be similar: we need to dynamically size up or down the kv portion of >> > the device. >> > >> > I see two basic options: >> > >> > 1) Wire into the Env abstraction in rocksdb to provide something just >> > smart enough to let rocksdb work. It isn't much: named files (not that >> > many--we could easily keep the file table in ram), always written >> > sequentially, to be read later with random access. All of the code is >> > written around abstractions of SequentialFileWriter so that everything >> > posix is neatly hidden in env_posix (and there are various other env >> > implementations for in-memory mock tests etc.). >> > >> > 2) Use something like dm-thin to sit between the raw block device and XFS >> > (for rocksdb) and the block device consumed by newstore. As long as XFS >> > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb >> > files in their entirety) we can fstrim and size down the fs portion. If >> > we similarly make newstores allocator stick to large blocks only we would >> > be able to size down the block portion as well. Typical dm-thin block >> > sizes seem to range from 64KB to 512KB, which seems reasonable enough to >> > me. In fact, we could likely just size the fs volume at something >> > conservatively large (like 90%) and rely on -o discard or periodic fstrim >> > to keep its actual utilization in check. >> > >> >> I think you could prototype a raw block device OSD store using LMDB as >> a starting point. I know there's been some experiments using LMDB as >> KV store before with positive read numbers and not great write >> numbers. >> >> 1. It mmaps, just mmap the raw disk device / partition. I've done this >> as an experiment before, I can dig up a patch for LMDB. >> 2. It already has a free space management strategy. I'm prob it's not >> right for the OSDs in the long term but there's something to start >> there with. >> 3. It's already supports transactions / COW. >> 4. LMDB isn't a huge code base so it might be a good place to start / >> evolve code from. >> 5. You're not starting a multi-year effort at the 0 point. >> >> As to the not great write performance, that could be addressed by >> write transaction merging (what mysql implemented a few years ago). > > We have a heavily hacked version of LMDB contributed by VMware that > implements a WAL. In my preliminary testing it performs synchronous writes > 30x faster (on average) than current LMDB. Their version unfortunately > slashed'n'burned a lot of LMDB features that other folks actually need, so > we can't use it as-is. Currently working on rationalizing the approach and > merging it into mdb.master. > > The reasons for the WAL approach: > 1) obviously sequential writes are cheaper than random writes. > 2) fsync() of a small log file will always be faster than fsync() of a > large DB. I.e., fsync() latency is proportional to the total number of pages > in the file, not just the number of dirty pages. This a bit off topic (from new store). More to Howard about LMDB internals and write serialization. Howard, there is way to make progress on pending transactions without WAL. LMDB is already COW so hypothetically further
Re: newstore direction
On 10/23/2015 02:21 AM, Howard Chu wrote: Normally, best practice is to use batching to avoid paying worst case latency >when you do a synchronous IO. Write a batch of files or appends without fsync, >then go back and fsync and you will pay that latency once (not per file/op). If filesystems would support ordered writes you wouldn't need to fsync at all. Just spit out a stream of writes and declare that batch N must be written before batch N+1. (Note that this is not identical to "write barriers", which imposed the same latencies as fsync by blocking all I/Os at a barrier boundary. Ordered writes may be freely interleaved with un-ordered writes, so normal I/O traffic can proceed unhindered. Their ordering is only enforced wrt other ordered writes.) A bit of a shame that Linux's SCSI drivers support Ordering attributes but nothing above that layer makes use of it. I think that if the stream on either side of the barrier is large enough, using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, should have the same performance. Not clear to me if we could do away with an fsync to trigger a cache flush here either - do SCSI ordered tags require that the writes be acknowledged only when durable, or can the device ack them once the target has them (including in a volatile write cache)? Ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/23/2015 07:06 AM, Ric Wheeler wrote: On 10/23/2015 02:21 AM, Howard Chu wrote: Normally, best practice is to use batching to avoid paying worst case latency >when you do a synchronous IO. Write a batch of files or appends without fsync, >then go back and fsync and you will pay that latency once (not per file/op). If filesystems would support ordered writes you wouldn't need to fsync at all. Just spit out a stream of writes and declare that batch N must be written before batch N+1. (Note that this is not identical to "write barriers", which imposed the same latencies as fsync by blocking all I/Os at a barrier boundary. Ordered writes may be freely interleaved with un-ordered writes, so normal I/O traffic can proceed unhindered. Their ordering is only enforced wrt other ordered writes.) A bit of a shame that Linux's SCSI drivers support Ordering attributes but nothing above that layer makes use of it. I think that if the stream on either side of the barrier is large enough, using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, should have the same performance. Not clear to me if we could do away with an fsync to trigger a cache flush here either - do SCSI ordered tags require that the writes be acknowledged only when durable, or can the device ack them once the target has them (including in a volatile write cache)? Ric One other note, the file & storage kernel people discussed using ordering years ago. One of the issues is that the devices themselves need to support. While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and still does not as far as I know?) support ordered tags. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/23/2015 10:59 AM, Howard Chu wrote: Ric Wheeler wrote: On 10/23/2015 07:06 AM, Ric Wheeler wrote: On 10/23/2015 02:21 AM, Howard Chu wrote: Normally, best practice is to use batching to avoid paying worst case latency >when you do a synchronous IO. Write a batch of files or appends without fsync, >then go back and fsync and you will pay that latency once (not per file/op). If filesystems would support ordered writes you wouldn't need to fsync at all. Just spit out a stream of writes and declare that batch N must be written before batch N+1. (Note that this is not identical to "write barriers", which imposed the same latencies as fsync by blocking all I/Os at a barrier boundary. Ordered writes may be freely interleaved with un-ordered writes, so normal I/O traffic can proceed unhindered. Their ordering is only enforced wrt other ordered writes.) One other note, the file & storage kernel people discussed using ordering years ago. One of the issues is that the devices themselves need to support. While S-ATA devices are portrayed as SCSI in the kernel, ATA does not (and still does not as far as I know?) support ordered tags. Yes, that's a bigger problem. ATA NCQ/TCQ aren't up to the job. >>> A bit of a shame that Linux's SCSI drivers support Ordering attributes but >>> nothing above that layer makes use of it. >> >> I think that if the stream on either side of the barrier is large enough, >> using ordered tags (SCSI speak) versus doing stream1, fsync(), stream2, >> should have the same performance. >> Not clear to me if we could do away with an fsync to trigger a cache flush >> here either - do SCSI ordered tags require that the writes be acknowledged >> only when durable, or can the device ack them once the target has them >> (including in a volatile write cache)? fsync() is too blunt a tool; its use gives you both C and D of ACID (Consistency and Durability). Ordered tags give you Consistency; there are lots of applications that can live without perfect Durability but losing Consistency is a major headache. If the stream of writes is large enough, you could omit fsync because everything is being forced out of the cache to disk anyway. In that scenario, the only thing that matters is that the writes get forced out in the order you intended, so that an interruption or crash leaves you in a known (or knowable) state vs unknown. I do agree that fsync is quite a blunt tool, but you cannot assume that a stream of writes will flush the cache - that is extremely firmware dependent. Pretty common to leave small IO's in cache and let larger IO's stream directly to the backing device (platter, etc) - those small objects can stay live and non-durable for days under some heavy workloads :) ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
Ric Wheeler redhat.com> writes: > > On 10/21/2015 09:32 AM, Sage Weil wrote: > > On Tue, 20 Oct 2015, Ric Wheeler wrote: > >>> Now: > >>> 1 io to write a new file > >>> 1-2 ios to sync the fs journal (commit the inode, alloc change) > >>> (I see 2 journal IOs on XFS and only 1 on ext4...) > >>> 1 io to commit the rocksdb journal (currently 3, but will drop to > >>> 1 with xfs fix and my rocksdb change) > >> I think that might be too pessimistic - the number of discrete IO's sent down > >> to a spinning disk make much less impact on performance than the number of > >> fsync()'s since they IO's all land in the write cache. Some newer spinning > >> drives have a non-volatile write cache, so even an fsync() might not end up > >> doing the expensive data transfer to the platter. > > True, but in XFS's case at least the file data and journal are not > > colocated, so its 2 seeks for the new file write+fdatasync and another for > > the rocksdb journal commit. Of course, with a deep queue, we're doing > > lots of these so there's be fewer journal commits on both counts, but the > > lower bound on latency of a single write is still 3 seeks, and that bound > > is pretty critical when you also have network round trips and replication > > (worst out of 2) on top. > > What are the performance goals we are looking for? > > Small, synchronous writes/second? > > File creates/second? > > I suspect that looking at things like seeks/write is probably looking at the > wrong level of performance challenges. Again, when you write to a modern drive, > you write to its write cache and it decides internally when/how to destage to > the platter. > > If you look at the performance of XFS with streaming workloads, it will tend to > max out the bandwidth of the underlaying storage. > > If we need IOP's/file writes, etc, we should be clear on what we are aiming at. > > > > >> It would be interesting to get the timings on the IO's you see to measure the > >> actual impact. > > I observed this with the journaling workload for rocksdb, but I assume the > > journaling behavior is the same regardless of what is being journaled. > > For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and > > blktrace showed an IO to the file, and 2 IOs to the journal. I believe > > the first one is the record for the inode update, and the second is the > > journal 'commit' record (though I forget how I decided that). My guess is > > that XFS is being extremely careful about journal integrity here and not > > writing the commit record until it knows that the preceding records landed > > on stable storage. For ext4, the latency was about ~20ms, and blktrace > > showed the IO to the file and then a single journal IO. When I made the > > rocksdb change to overwrite an existing, prewritten file, the latency > > dropped to ~10ms on ext4, and blktrace showed a single IO as expected. > > (XFS still showed the 2 journal commit IOs, but Dave just posted the fix > > for that on the XFS list today.) > Normally, best practice is to use batching to avoid paying worst case latency > when you do a synchronous IO. Write a batch of files or appends without fsync, > then go back and fsync and you will pay that latency once (not per file/op). If filesystems would support ordered writes you wouldn't need to fsync at all. Just spit out a stream of writes and declare that batch N must be written before batch N+1. (Note that this is not identical to "write barriers", which imposed the same latencies as fsync by blocking all I/Os at a barrier boundary. Ordered writes may be freely interleaved with un-ordered writes, so normal I/O traffic can proceed unhindered. Their ordering is only enforced wrt other ordered writes.) A bit of a shame that Linux's SCSI drivers support Ordering attributes but nothing above that layer makes use of it. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
Gregory Farnum wrote: On Fri, Oct 23, 2015 at 7:59 AM, Howard Chuwrote: If the stream of writes is large enough, you could omit fsync because everything is being forced out of the cache to disk anyway. In that scenario, the only thing that matters is that the writes get forced out in the order you intended, so that an interruption or crash leaves you in a known (or knowable) state vs unknown. The RADOS storage semantics actually require that we know it's durable on disk as well, unfortunately. But ordered writes would probably let us batch up commit points in ways that are a lot friendlier for the drives! Ah, that's too bad. LMDB does fine with only ordering, but never mind. -- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Thu, 2015-10-22 at 02:12 +, Allen Samuels wrote: > One of the biggest changes that flash is making in the storage world is that > the way basic trade-offs in storage management software architecture are > being affected. In the HDD world CPU time per IOP was relatively > inconsequential, i.e., it had little effect on overall performance which was > limited by the physics of the hard drive. Flash is now inverting that > situation. When you look at the performance levels being delivered in the > latest generation of NVMe SSDs you rapidly see that that storage itself is > generally no longer the bottleneck (speaking about BW, not latency of course) > but rather it's the system sitting in front of the storage that is the > bottleneck. Generally it's the CPU cost of an IOP. > > When Sandisk first starting working with Ceph (Dumpling) the design of > librados and the OSD lead to the situation that the CPU cost of an IOP was > dominated by context switches and network socket handling. Over time, much of > that has been addressed. The socket handling code has been re-written (more > than once!) some of the internal queueing in the OSD (and the associated > context switches) have been eliminated. As the CPU costs have dropped, > performance on flash has improved accordingly. > > Because we didn't want to completely re-write the OSD (time-to-market and > stability drove that decision), we didn't move it from the current "thread > per IOP" model into a truly asynchronous "thread per CPU core" model that > essentially eliminates context switches in the IO path. But a fully optimized > OSD would go down that path (at least part-way). I believe it's been proposed > in the past. Perhaps a hybrid "fast-path" style could get most of the > benefits while preserving much of the legacy code. > +1 It not just reducing context switches but also about removing contention and data copies and getting better cache utilization. Scylladb just did this to cassandra (using seastar library): http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-drop-in-replacement-scylla/ Orit > I believe this trend toward thread-per-core software development will also > tend to support the "do it in user-space" trend. That's because most of the > kernel and file-system interface is architected around the blocking > "thread-per-IOP" model and is unlikely to change in the future. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samu...@sandisk.com > > -Original Message- > From: Martin Millnert [mailto:mar...@millnert.se] > Sent: Thursday, October 22, 2015 6:20 AM > To: Mark Nelson <mnel...@redhat.com> > Cc: Ric Wheeler <rwhee...@redhat.com>; Allen Samuels > <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; > ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Adding 2c > > On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > > My thought is that there is some inflection point where the userland > > kvstore/block approach is going to be less work, for everyone I think, > > than trying to quickly discover, understand, fix, and push upstream > > patches that sometimes only really benefit us. I don't know if we've > > truly hit that that point, but it's tough for me to find flaws with > > Sage's argument. > > Regarding the userland / kernel land aspect of the topic, there are further > aspects AFAIK not yet addressed in the thread: > In the networking world, there's been development on memory mapped (multiple > approaches exist) userland networking, which for packet management has the > benefit of - for very, very specific applications of networking code - > avoiding e.g. per-packet context switches etc, and streamlining processor > cache management performance. People have gone as far as removing CPU cores > from CPU scheduler to completely dedicate them to the networking task at hand > (cache optimizations). There are various latency/throughput (bulking) > optimizations applicable, but at the end of the day, it's about keeping the > CPU bus busy with "revenue" bus traffic. > > Granted, storage IO operations may be much heavier in cycle counts for > context switches to ever appear as a problem in themselves, certainly for > slower SSDs and HDDs. However, when going for truly high performance IO, > *every* hurdle in the data path counts toward the total latency. > (And really, high performance random IO characteristics approaches the > networking, per-packet handling characteristics). Now, I'm n
Re: newstore direction
On Wed, Oct 21, 2015 at 10:30:28AM -0700, Sage Weil wrote: > For example: we need to do an overwrite of an existing object that is > atomic with respect to a larger ceph transaction (we're updating a bunch > of other metadata at the same time, possibly overwriting or appending to > multiple files, etc.). XFS and ext4 aren't cow file systems, so plugging > into the transaction infrastructure isn't really an option (and even after > several years of trying to do it with btrfs it proved to be impractical). Not that I'm disagreeing with most of your points, but we can do things like that with swapext-like hacks. Below is my half year old prototype of an O_ATOMIC implementation for XFS that gives you atomic out of place writes. diff --git a/fs/fcntl.c b/fs/fcntl.c index ee85cd4..001dd49 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -740,7 +740,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( O_RDONLY| O_WRONLY | O_RDWR| O_CREAT | O_EXCL| O_NOCTTY | O_TRUNC | O_APPEND | /* O_NONBLOCK | */ @@ -748,6 +748,7 @@ static int __init fcntl_init(void) O_DIRECT| O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME | O_CLOEXEC | __FMODE_EXEC| O_PATH| __O_TMPFILE | + O_ATOMIC| __FMODE_NONOTIFY )); diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index aeffeaa..8eafca6 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -4681,14 +4681,14 @@ xfs_bmap_del_extent( xfs_btree_cur_t *cur, /* if null, not a btree */ xfs_bmbt_irec_t *del, /* data to remove from extents */ int *logflagsp, /* inode logging flags */ - int whichfork) /* data or attr fork */ + int whichfork, /* data or attr fork */ + boolfree_blocks) /* free extent at end of routine */ { xfs_filblks_t da_new; /* new delay-alloc indirect blocks */ xfs_filblks_t da_old; /* old delay-alloc indirect blocks */ xfs_fsblock_t del_endblock=0; /* first block past del */ xfs_fileoff_t del_endoff; /* first offset past del */ int delay; /* current block is delayed allocated */ - int do_fx; /* free extent at end of routine */ xfs_bmbt_rec_host_t *ep;/* current extent entry pointer */ int error; /* error return value */ int flags; /* inode logging flags */ @@ -4712,8 +4712,8 @@ xfs_bmap_del_extent( mp = ip->i_mount; ifp = XFS_IFORK_PTR(ip, whichfork); - ASSERT((*idx >= 0) && (*idx < ifp->if_bytes / - (uint)sizeof(xfs_bmbt_rec_t))); + ASSERT(*idx >= 0); + ASSERT(*idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)); ASSERT(del->br_blockcount > 0); ep = xfs_iext_get_ext(ifp, *idx); xfs_bmbt_get_all(ep, ); @@ -4746,10 +4746,13 @@ xfs_bmap_del_extent( len = del->br_blockcount; do_div(bno, mp->m_sb.sb_rextsize); do_div(len, mp->m_sb.sb_rextsize); - error = xfs_rtfree_extent(tp, bno, (xfs_extlen_t)len); - if (error) - goto done; - do_fx = 0; + if (free_blocks) { + error = xfs_rtfree_extent(tp, bno, + (xfs_extlen_t)len); + if (error) + goto done; + free_blocks = 0; + } nblks = len * mp->m_sb.sb_rextsize; qfield = XFS_TRANS_DQ_RTBCOUNT; } @@ -4757,7 +4760,6 @@ xfs_bmap_del_extent( * Ordinary allocation. */ else { - do_fx = 1; nblks = del->br_blockcount; qfield = XFS_TRANS_DQ_BCOUNT; } @@ -4777,7 +4779,7 @@ xfs_bmap_del_extent( da_old = startblockval(got.br_startblock); da_new = 0; nblks = 0; - do_fx = 0; + free_blocks = 0; } /* * Set flag value to use in switch statement. @@ -4963,7 +4965,7 @@ xfs_bmap_del_extent( /* * If we
RE: newstore direction
Hi Sage and other fellow cephers, I truly share the pains with you all about filesystem while I am working on objectstore to improve the performance. As mentioned , there is nothing wrong with filesystem. Just the Ceph as one of use case need more supports but not provided in near future by filesystem no matter what reasons. There are so many techniques pop out which can help to improve performance of OSD. User space driver(DPDK from Intel) is one of them. It not only gives you the storage allocator, also gives you the thread scheduling support, CPU affinity , NUMA friendly, polling which might fundamentally change the performance of objectstore. It should not be hard to improve CPU utilization 3x~5x times, higher IOPS etc. I totally agreed that goal of filestore is to gives enough support for filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new design goal of objectstore should focus on giving the best performance for OSD with new techniques. These two goals are not going to conflict with each other. They are just for different purposes to make Ceph not only more stable but also better. Scylla mentioned by Orit is a good example . Thanks all. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, October 22, 2015 5:50 AM To: Ric Wheeler Cc: Orit Wasserman; ceph-devel@vger.kernel.org Subject: Re: newstore direction On Wed, 21 Oct 2015, Ric Wheeler wrote: > You will have to trust me on this as the Red Hat person who spoke to > pretty much all of our key customers about local file systems and > storage - customers all have migrated over to using normal file systems under > Oracle/DB2. > Typically, they use XFS or ext4. I don't know of any non-standard > file systems and only have seen one account running on a raw block > store in 8 years > :) > > If you have a pre-allocated file and write using O_DIRECT, your IO > path is identical in terms of IO's sent to the device. > > If we are causing additional IO's, then we really need to spend some > time talking to the local file system gurus about this in detail. I > can help with that conversation. If the file is truly preallocated (that is, prewritten with zeros... fallocate doesn't help here because the extents is marked unwritten), then sure: there is very little change in the data path. But at that point, what is the point? This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.). Do they just do this to ease administrative tasks like backup? This is the fundamental tradeoff: 1) We have a file per object. We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us. 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc. 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device). The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid). But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code. At the end of the day, 1 and 1b are always going to be slower than 2. And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2. On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower. Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive. Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs. And that's assume we get everything we need upstream... which is probably a year's endeavour. Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph. Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems. But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend). And as you know performance is a huge pain point. We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the
Re: newstore direction
Since the changes which moved the pg log and the pg info into the pg object space, I think it's now the case that any transaction submitted to the objectstore updates a disjoint range of objects determined by the sequencer. It might be easier to exploit that parallelism if we control allocation and allocation related metadata. We could split the store into N pieces which partition the pg space (one additional one for the meta sequencer?) with one rocksdb instance for each. Space could then be parcelled out in large pieces (small frequency of global allocation decisions) and managed more finely within each partition. The main challenge would be avoiding internal fragmentation of those, but at least defragmentation can be managed on a per-partition basis. Such parallelism is probably necessary to exploit the full throughput of some ssds. -Sam On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI <james@ssi.samsung.com> wrote: > Hi Sage and other fellow cephers, > I truly share the pains with you all about filesystem while I am working > on objectstore to improve the performance. As mentioned , there is nothing > wrong with filesystem. Just the Ceph as one of use case need more supports > but not provided in near future by filesystem no matter what reasons. > >There are so many techniques pop out which can help to improve > performance of OSD. User space driver(DPDK from Intel) is one of them. It > not only gives you the storage allocator, also gives you the thread > scheduling support, CPU affinity , NUMA friendly, polling which might > fundamentally change the performance of objectstore. It should not be hard > to improve CPU utilization 3x~5x times, higher IOPS etc. > I totally agreed that goal of filestore is to gives enough support for > filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new > design goal of objectstore should focus on giving the best performance for > OSD with new techniques. These two goals are not going to conflict with each > other. They are just for different purposes to make Ceph not only more > stable but also better. > > Scylla mentioned by Orit is a good example . > > Thanks all. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Thursday, October 22, 2015 5:50 AM > To: Ric Wheeler > Cc: Orit Wasserman; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On Wed, 21 Oct 2015, Ric Wheeler wrote: >> You will have to trust me on this as the Red Hat person who spoke to >> pretty much all of our key customers about local file systems and >> storage - customers all have migrated over to using normal file systems >> under Oracle/DB2. >> Typically, they use XFS or ext4. I don't know of any non-standard >> file systems and only have seen one account running on a raw block >> store in 8 years >> :) >> >> If you have a pre-allocated file and write using O_DIRECT, your IO >> path is identical in terms of IO's sent to the device. >> >> If we are causing additional IO's, then we really need to spend some >> time talking to the local file system gurus about this in detail. I >> can help with that conversation. > > If the file is truly preallocated (that is, prewritten with zeros... > fallocate doesn't help here because the extents is marked unwritten), then > sure: there is very little change in the data path. > > But at that point, what is the point? This only works if you have one (or a > few) huge files and the user space app already has all the complexity of a > filesystem-like thing (with its own internal journal, allocators, garbage > collection, etc.). Do they just do this to ease administrative tasks like > backup? > > > This is the fundamental tradeoff: > > 1) We have a file per object. We fsync like crazy and the fact that there > are two independent layers journaling and managing different types of > consistency penalizes us. > > 1b) We get clever and start using obscure and/or custom ioctls in the file > system to work around what it is used to: we swap extents to avoid > write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, > batch fsync, O_ATOMIC, setext ioctl, etc. > > 2) We preallocate huge files and write a user-space object system that lives > within it (pretending the file is a block device). The file system rarely > gets in the way (assuming the file is prewritten and we don't do anything > stupid). But it doesn't give us anything a block device wouldn't, and it > doesn't save us any complexity in our code. > > At the end of the day, 1 an
Re: newstore direction
Ah, except for the snapmapper. We can split the snapmapper in the same way, though, as long as we are careful with the name. -Sam On Thu, Oct 22, 2015 at 4:42 PM, Samuel Just <sj...@redhat.com> wrote: > Since the changes which moved the pg log and the pg info into the pg > object space, I think it's now the case that any transaction submitted > to the objectstore updates a disjoint range of objects determined by > the sequencer. It might be easier to exploit that parallelism if we > control allocation and allocation related metadata. We could split > the store into N pieces which partition the pg space (one additional > one for the meta sequencer?) with one rocksdb instance for each. > Space could then be parcelled out in large pieces (small frequency of > global allocation decisions) and managed more finely within each > partition. The main challenge would be avoiding internal > fragmentation of those, but at least defragmentation can be managed on > a per-partition basis. Such parallelism is probably necessary to > exploit the full throughput of some ssds. > -Sam > > On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI > <james@ssi.samsung.com> wrote: >> Hi Sage and other fellow cephers, >> I truly share the pains with you all about filesystem while I am working >> on objectstore to improve the performance. As mentioned , there is nothing >> wrong with filesystem. Just the Ceph as one of use case need more supports >> but not provided in near future by filesystem no matter what reasons. >> >>There are so many techniques pop out which can help to improve >> performance of OSD. User space driver(DPDK from Intel) is one of them. It >> not only gives you the storage allocator, also gives you the thread >> scheduling support, CPU affinity , NUMA friendly, polling which might >> fundamentally change the performance of objectstore. It should not be hard >> to improve CPU utilization 3x~5x times, higher IOPS etc. >> I totally agreed that goal of filestore is to gives enough support for >> filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new >> design goal of objectstore should focus on giving the best performance for >> OSD with new techniques. These two goals are not going to conflict with each >> other. They are just for different purposes to make Ceph not only more >> stable but also better. >> >> Scylla mentioned by Orit is a good example . >> >> Thanks all. >> >> Regards, >> James >> >> -Original Message- >> From: ceph-devel-ow...@vger.kernel.org >> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil >> Sent: Thursday, October 22, 2015 5:50 AM >> To: Ric Wheeler >> Cc: Orit Wasserman; ceph-devel@vger.kernel.org >> Subject: Re: newstore direction >> >> On Wed, 21 Oct 2015, Ric Wheeler wrote: >>> You will have to trust me on this as the Red Hat person who spoke to >>> pretty much all of our key customers about local file systems and >>> storage - customers all have migrated over to using normal file systems >>> under Oracle/DB2. >>> Typically, they use XFS or ext4. I don't know of any non-standard >>> file systems and only have seen one account running on a raw block >>> store in 8 years >>> :) >>> >>> If you have a pre-allocated file and write using O_DIRECT, your IO >>> path is identical in terms of IO's sent to the device. >>> >>> If we are causing additional IO's, then we really need to spend some >>> time talking to the local file system gurus about this in detail. I >>> can help with that conversation. >> >> If the file is truly preallocated (that is, prewritten with zeros... >> fallocate doesn't help here because the extents is marked unwritten), then >> sure: there is very little change in the data path. >> >> But at that point, what is the point? This only works if you have one (or a >> few) huge files and the user space app already has all the complexity of a >> filesystem-like thing (with its own internal journal, allocators, garbage >> collection, etc.). Do they just do this to ease administrative tasks like >> backup? >> >> >> This is the fundamental tradeoff: >> >> 1) We have a file per object. We fsync like crazy and the fact that there >> are two independent layers journaling and managing different types of >> consistency penalizes us. >> >> 1b) We get clever and start using obscure and/or custom ioctls in the file >> system to work around what it
RE: newstore direction
How would this kind of split affect small transactions? Will each split be separately transactionally consistent or is there some kind of meta-transaction that synchronizes each of the splits? Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Friday, October 23, 2015 8:42 AM To: James (Fei) Liu-SSI <james@ssi.samsung.com> Cc: Sage Weil <sw...@redhat.com>; Ric Wheeler <rwhee...@redhat.com>; Orit Wasserman <owass...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction Since the changes which moved the pg log and the pg info into the pg object space, I think it's now the case that any transaction submitted to the objectstore updates a disjoint range of objects determined by the sequencer. It might be easier to exploit that parallelism if we control allocation and allocation related metadata. We could split the store into N pieces which partition the pg space (one additional one for the meta sequencer?) with one rocksdb instance for each. Space could then be parcelled out in large pieces (small frequency of global allocation decisions) and managed more finely within each partition. The main challenge would be avoiding internal fragmentation of those, but at least defragmentation can be managed on a per-partition basis. Such parallelism is probably necessary to exploit the full throughput of some ssds. -Sam On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI <james@ssi.samsung.com> wrote: > Hi Sage and other fellow cephers, > I truly share the pains with you all about filesystem while I am working > on objectstore to improve the performance. As mentioned , there is nothing > wrong with filesystem. Just the Ceph as one of use case need more supports > but not provided in near future by filesystem no matter what reasons. > >There are so many techniques pop out which can help to improve > performance of OSD. User space driver(DPDK from Intel) is one of them. It > not only gives you the storage allocator, also gives you the thread > scheduling support, CPU affinity , NUMA friendly, polling which might > fundamentally change the performance of objectstore. It should not be hard > to improve CPU utilization 3x~5x times, higher IOPS etc. > I totally agreed that goal of filestore is to gives enough support for > filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new > design goal of objectstore should focus on giving the best performance for > OSD with new techniques. These two goals are not going to conflict with each > other. They are just for different purposes to make Ceph not only more > stable but also better. > > Scylla mentioned by Orit is a good example . > > Thanks all. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Thursday, October 22, 2015 5:50 AM > To: Ric Wheeler > Cc: Orit Wasserman; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On Wed, 21 Oct 2015, Ric Wheeler wrote: >> You will have to trust me on this as the Red Hat person who spoke to >> pretty much all of our key customers about local file systems and >> storage - customers all have migrated over to using normal file systems >> under Oracle/DB2. >> Typically, they use XFS or ext4. I don't know of any non-standard >> file systems and only have seen one account running on a raw block >> store in 8 years >> :) >> >> If you have a pre-allocated file and write using O_DIRECT, your IO >> path is identical in terms of IO's sent to the device. >> >> If we are causing additional IO's, then we really need to spend some >> time talking to the local file system gurus about this in detail. I >> can help with that conversation. > > If the file is truly preallocated (that is, prewritten with zeros... > fallocate doesn't help here because the extents is marked unwritten), > then > sure: there is very little change in the data path. > > But at that point, what is the point? This only works if you have one (or a > few) huge files and the user space app already has all the complexity of a > filesystem-like thing (with its own internal journal, allocators, garbage > collection, etc.). Do they just do this to ease administrative tasks like > backup? > > > This is the fundamental tradeoff: > > 1) We have a file per object. We fsync like crazy and the fact that the
Re: newstore direction
I disagree with your point still - your argument was that customers don't like to update their code so we cannot rely on them moving to better file system code. Those same customers would be *just* as reluctant to upgrade OSD code. Been there, done that in pure block storage, pure object storage and in file system code (customers just don't care about the protocol, the conservative nature is consistent). Not a casual observation, I have been building storage systems since the mid-80's. Regards, Ric On 10/21/2015 09:22 PM, Allen Samuels wrote: I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Thursday, October 22, 2015 10:17 AM To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 08:53 PM, Allen Samuels wrote: Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/22/2015 08:50 AM, Sage Weil wrote: On Wed, 21 Oct 2015, Ric Wheeler wrote: You will have to trust me on this as the Red Hat person who spoke to pretty much all of our key customers about local file systems and storage - customers all have migrated over to using normal file systems under Oracle/DB2. Typically, they use XFS or ext4. I don't know of any non-standard file systems and only have seen one account running on a raw block store in 8 years :) If you have a pre-allocated file and write using O_DIRECT, your IO path is identical in terms of IO's sent to the device. If we are causing additional IO's, then we really need to spend some time talking to the local file system gurus about this in detail. I can help with that conversation. If the file is truly preallocated (that is, prewritten with zeros... fallocate doesn't help here because the extents is marked unwritten), then sure: there is very little change in the data path. But at that point, what is the point? This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.). Do they just do this to ease administrative tasks like backup? I think that the key here is that if we fsync() like crazy - regardless of writing to a file system or to some new, yet to be define block device primitive store - we are limited to the IOP's of that particular block device. Ignoring exotic hardware configs for people who can ignore all SSD devices, we will have rotating, high capacity, slow spinning drives for *a long time* as the eventual tier. Given that assumption, we need to do better then to be limited to synchronous IOP's for a slow drive. When we have commodity pricing for things like persistent DRAM, then I agree that writing directly to that medium makes sense (but you can do that with DAX by effectively mapping that into the process address space). Specifically, moving from a file system with some inefficiencies will only boost performance from say 20-30 IOP's to roughly 40-50 IOP's. The way this has been handled traditionally for things like databases, etc is: * batch up the transactions that need to be destaged * issue an O_DIRECT async IO for all of the elements that need to be written (bypassed the page cache, direct to the backing store) * wait for completion We should probably add to that sequence an fsync() of the directory (or a file in the file system) to insure that any volatile write cache is invalidated, but there is *no* reason to fsync() each file. I think that we need to look at why the write pattern is so heavily synchronous and single threaded if we are hoping to extract from any given storage tier its maximum performance. Doing this can raise your file creations per second (or allocations per second) from a few dozen to a few hundred or more per second. The complexity that writing a new block level allocation strategy that you save is: * if you lay out a lot of small objects on the block store that can grow, we will quickly end up doing very complicated techniques that file systems solved a long time ago (pre-allocation, etc) * multi-stream aware allocation if you have multiple processes writing to the same store * tracking things like allocated but unwritten (can happen if some process "pokes" a hole in an object, common with things like virtual machine images) One we end up handling all of that in new, untested code, I think that we end up with a lot of pain and only minimal gain in terms of performance. ric This is the fundamental tradeoff: 1) We have a file per object. We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us. 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc. 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device). The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid). But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code. At the end of the day, 1 and 1b are always going to be slower than 2. And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2. On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower. Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive. Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic
Re: newstore direction
Milosz Tanski adfin.com> writes: > > On Tue, Oct 20, 2015 at 4:00 PM, Sage Weil redhat.com> wrote: > > On Tue, 20 Oct 2015, John Spray wrote: > >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil redhat.com> wrote: > >> > - We have to size the kv backend storage (probably still an XFS > >> > partition) vs the block storage. Maybe we do this anyway (put metadata on > >> > SSD!) so it won't matter. But what happens when we are storing gobs of > >> > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > >> > a different pool and those aren't currently fungible. > >> > >> This is the concerning bit for me -- the other parts one "just" has to > >> get the code right, but this problem could linger and be something we > >> have to keep explaining to users indefinitely. It reminds me of cases > >> in other systems where users had to make an educated guess about inode > >> size up front, depending on whether you're expecting to efficiently > >> store a lot of xattrs. > >> > >> In practice it's rare for users to make these kinds of decisions well > >> up-front: it really needs to be adjustable later, ideally > >> automatically. That could be pretty straightforward if the KV part > >> was stored directly on block storage, instead of having XFS in the > >> mix. I'm not quite up with the state of the art in this area: are > >> there any reasonable alternatives for the KV part that would consume > >> some defined range of a block device from userspace, instead of > >> sitting on top of a filesystem? > > > > I agree: this is my primary concern with the raw block approach. > > > > There are some KV alternatives that could consume block, but the problem > > would be similar: we need to dynamically size up or down the kv portion of > > the device. > > > > I see two basic options: > > > > 1) Wire into the Env abstraction in rocksdb to provide something just > > smart enough to let rocksdb work. It isn't much: named files (not that > > many--we could easily keep the file table in ram), always written > > sequentially, to be read later with random access. All of the code is > > written around abstractions of SequentialFileWriter so that everything > > posix is neatly hidden in env_posix (and there are various other env > > implementations for in-memory mock tests etc.). > > > > 2) Use something like dm-thin to sit between the raw block device and XFS > > (for rocksdb) and the block device consumed by newstore. As long as XFS > > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > > files in their entirety) we can fstrim and size down the fs portion. If > > we similarly make newstores allocator stick to large blocks only we would > > be able to size down the block portion as well. Typical dm-thin block > > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > > me. In fact, we could likely just size the fs volume at something > > conservatively large (like 90%) and rely on -o discard or periodic fstrim > > to keep its actual utilization in check. > > > > I think you could prototype a raw block device OSD store using LMDB as > a starting point. I know there's been some experiments using LMDB as > KV store before with positive read numbers and not great write > numbers. > > 1. It mmaps, just mmap the raw disk device / partition. I've done this > as an experiment before, I can dig up a patch for LMDB. > 2. It already has a free space management strategy. I'm prob it's not > right for the OSDs in the long term but there's something to start > there with. > 3. It's already supports transactions / COW. > 4. LMDB isn't a huge code base so it might be a good place to start / > evolve code from. > 5. You're not starting a multi-year effort at the 0 point. > > As to the not great write performance, that could be addressed by > write transaction merging (what mysql implemented a few years ago). We have a heavily hacked version of LMDB contributed by VMware that implements a WAL. In my preliminary testing it performs synchronous writes 30x faster (on average) than current LMDB. Their version unfortunately slashed'n'burned a lot of LMDB features that other folks actually need, so we can't use it as-is. Currently working on rationalizing the approach and merging it into mdb.master. The reasons for the WAL approach: 1) obviously sequential writes are cheaper than random writes. 2) fsync() of a small log file will always be faster than fsync() of a large DB. I.e., fsync() latency is proportional to the total number of pages in the file, not just the number of dirty pages. LMDB on a raw block device is a simpler proposition, and one we intend to integrate soon as well. (Milosz, did you ever submit your changes?) > Here you have an opportunity to do it two days. One, you can do it in > the application layer while waiting for the fsync from transaction to > complete. This is probably the easier route. Two, you can do it in the > DB layer (the LMDB
Re: newstore direction
On Wed, 21 Oct 2015, Ric Wheeler wrote: > You will have to trust me on this as the Red Hat person who spoke to pretty > much all of our key customers about local file systems and storage - customers > all have migrated over to using normal file systems under Oracle/DB2. > Typically, they use XFS or ext4. I don't know of any non-standard file > systems and only have seen one account running on a raw block store in 8 years > :) > > If you have a pre-allocated file and write using O_DIRECT, your IO path is > identical in terms of IO's sent to the device. > > If we are causing additional IO's, then we really need to spend some time > talking to the local file system gurus about this in detail. I can help with > that conversation. If the file is truly preallocated (that is, prewritten with zeros... fallocate doesn't help here because the extents is marked unwritten), then sure: there is very little change in the data path. But at that point, what is the point? This only works if you have one (or a few) huge files and the user space app already has all the complexity of a filesystem-like thing (with its own internal journal, allocators, garbage collection, etc.). Do they just do this to ease administrative tasks like backup? This is the fundamental tradeoff: 1) We have a file per object. We fsync like crazy and the fact that there are two independent layers journaling and managing different types of consistency penalizes us. 1b) We get clever and start using obscure and/or custom ioctls in the file system to work around what it is used to: we swap extents to avoid write-ahead (see Christoph's patch), O_NOMTIME, unprivileged open-by-handle, batch fsync, O_ATOMIC, setext ioctl, etc. 2) We preallocate huge files and write a user-space object system that lives within it (pretending the file is a block device). The file system rarely gets in the way (assuming the file is prewritten and we don't do anything stupid). But it doesn't give us anything a block device wouldn't, and it doesn't save us any complexity in our code. At the end of the day, 1 and 1b are always going to be slower than 2. And although 1b performs a bit better than 1, it has similar (user-space) complexity to 2. On the other hand, if you step back and view teh entire stack (ceph-osd + XFS), 1 and 1b are *significantly* more complex than 2... and yet still slower. Given we ultimately have to support both (both as an upstream and as a distro), that's not very attractive. Also note that every time we have strayed off the reservation from the beaten path (1) to anything mildly exotic (1b) we have been bitten by obscure file systems bugs. And that's assume we get everything we need upstream... which is probably a year's endeavour. Don't get me wrong: I'm all for making changes to file systems to better support systems like Ceph. Things like O_NOCMTIME and O_ATOMIC make a huge amount of sense of a ton of different systems. But our situations is a bit different: we always own the entire device (and often the server), so there is no need to share with other users or apps (and when you do, you just use the existing FileStore backend). And as you know performance is a huge pain point. We are already handicapped by virtue of being distributed and strongly consistent; we can't afford to give away more to a storage layer that isn't providing us much (or the right) value. And I'm tired of half measures. I want the OSD to be as fast as we can make it given the architectural constraints (RADOS consistency and ordering semantics). This is truly low-hanging fruit: it's modular, self-contained, pluggable, and this will be my third time around this particular block. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Tue, Oct 20, 2015 at 4:00 PM, Sage Weilwrote: > On Tue, 20 Oct 2015, John Spray wrote: >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil wrote: >> > - We have to size the kv backend storage (probably still an XFS >> > partition) vs the block storage. Maybe we do this anyway (put metadata on >> > SSD!) so it won't matter. But what happens when we are storing gobs of >> > rgw index data or cephfs metadata? Suddenly we are pulling storage out of >> > a different pool and those aren't currently fungible. >> >> This is the concerning bit for me -- the other parts one "just" has to >> get the code right, but this problem could linger and be something we >> have to keep explaining to users indefinitely. It reminds me of cases >> in other systems where users had to make an educated guess about inode >> size up front, depending on whether you're expecting to efficiently >> store a lot of xattrs. >> >> In practice it's rare for users to make these kinds of decisions well >> up-front: it really needs to be adjustable later, ideally >> automatically. That could be pretty straightforward if the KV part >> was stored directly on block storage, instead of having XFS in the >> mix. I'm not quite up with the state of the art in this area: are >> there any reasonable alternatives for the KV part that would consume >> some defined range of a block device from userspace, instead of >> sitting on top of a filesystem? > > I agree: this is my primary concern with the raw block approach. > > There are some KV alternatives that could consume block, but the problem > would be similar: we need to dynamically size up or down the kv portion of > the device. > > I see two basic options: > > 1) Wire into the Env abstraction in rocksdb to provide something just > smart enough to let rocksdb work. It isn't much: named files (not that > many--we could easily keep the file table in ram), always written > sequentially, to be read later with random access. All of the code is > written around abstractions of SequentialFileWriter so that everything > posix is neatly hidden in env_posix (and there are various other env > implementations for in-memory mock tests etc.). > > 2) Use something like dm-thin to sit between the raw block device and XFS > (for rocksdb) and the block device consumed by newstore. As long as XFS > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > files in their entirety) we can fstrim and size down the fs portion. If > we similarly make newstores allocator stick to large blocks only we would > be able to size down the block portion as well. Typical dm-thin block > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > me. In fact, we could likely just size the fs volume at something > conservatively large (like 90%) and rely on -o discard or periodic fstrim > to keep its actual utilization in check. > I think you could prototype a raw block device OSD store using LMDB as a starting point. I know there's been some experiments using LMDB as KV store before with positive read numbers and not great write numbers. 1. It mmaps, just mmap the raw disk device / partition. I've done this as an experiment before, I can dig up a patch for LMDB. 2. It already has a free space management strategy. I'm prob it's not right for the OSDs in the long term but there's something to start there with. 3. It's already supports transactions / COW. 4. LMDB isn't a huge code base so it might be a good place to start / evolve code from. 5. You're not starting a multi-year effort at the 0 point. As to the not great write performance, that could be addressed by write transaction merging (what mysql implemented a few years ago). Here you have an opportunity to do it two days. One, you can do it in the application layer while waiting for the fsync from transaction to complete. This is probably the easier route. Two, you can do it in the DB layer (the LMDB transaction handling / locking) where you're already started processing the following transactions using the currently committing transaction (COW) as a starting point. This is harder mostly because of the synchronization needed or involved. I've actually spend some time thinking about doing LMDB write transaction merging outside the OSD context. This was for another project. My 2 cents. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/21/2015 05:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: Has there been any discussion regarding opensourcing zetascale? (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocatio
Re: newstore direction
On 10/21/2015 09:32 AM, Sage Weil wrote: On Tue, 20 Oct 2015, Ric Wheeler wrote: Now: 1 io to write a new file 1-2 ios to sync the fs journal (commit the inode, alloc change) (I see 2 journal IOs on XFS and only 1 on ext4...) 1 io to commit the rocksdb journal (currently 3, but will drop to 1 with xfs fix and my rocksdb change) I think that might be too pessimistic - the number of discrete IO's sent down to a spinning disk make much less impact on performance than the number of fsync()'s since they IO's all land in the write cache. Some newer spinning drives have a non-volatile write cache, so even an fsync() might not end up doing the expensive data transfer to the platter. True, but in XFS's case at least the file data and journal are not colocated, so its 2 seeks for the new file write+fdatasync and another for the rocksdb journal commit. Of course, with a deep queue, we're doing lots of these so there's be fewer journal commits on both counts, but the lower bound on latency of a single write is still 3 seeks, and that bound is pretty critical when you also have network round trips and replication (worst out of 2) on top. What are the performance goals we are looking for? Small, synchronous writes/second? File creates/second? I suspect that looking at things like seeks/write is probably looking at the wrong level of performance challenges. Again, when you write to a modern drive, you write to its write cache and it decides internally when/how to destage to the platter. If you look at the performance of XFS with streaming workloads, it will tend to max out the bandwidth of the underlaying storage. If we need IOP's/file writes, etc, we should be clear on what we are aiming at. It would be interesting to get the timings on the IO's you see to measure the actual impact. I observed this with the journaling workload for rocksdb, but I assume the journaling behavior is the same regardless of what is being journaled. For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and blktrace showed an IO to the file, and 2 IOs to the journal. I believe the first one is the record for the inode update, and the second is the journal 'commit' record (though I forget how I decided that). My guess is that XFS is being extremely careful about journal integrity here and not writing the commit record until it knows that the preceding records landed on stable storage. For ext4, the latency was about ~20ms, and blktrace showed the IO to the file and then a single journal IO. When I made the rocksdb change to overwrite an existing, prewritten file, the latency dropped to ~10ms on ext4, and blktrace showed a single IO as expected. (XFS still showed the 2 journal commit IOs, but Dave just posted the fix for that on the XFS list today.) Right, if we want to avoid metadata related IO's, we can preallocate a file and use O_DIRECT. Effectively, there should be no updates outside of the data write itself. Also won't be performance optimizations, but we could avoid redoing allocation and defragmentation again. Normally, best practice is to use batching to avoid paying worst case latency when you do a synchronous IO. Write a batch of files or appends without fsync, then go back and fsync and you will pay that latency once (not per file/op). Plumbing for T10 DIF/DIX already exist, what is missing is the normal block device that handles them (not enterprise SAS/disk array class) Yeah... which unfortunately means that unless the cheap drives suddenly start shipping if DIF/DIX support we'll need to do the checksums ourselves. This is probably a good thing anyway as it doesn't constrain our choice of checksum or checksum granularity, and will still work with other storage devices (ssds, nvme, etc.). sage Might be interesting to see if a device mapper target could be written to support DIF/DIX. For what it's worth, XFS developers have talked loosely about looking at data block checksums (could do something like btrfs does, store the checksums in another btree) ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon). But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable. Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough. Anyway, I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure. a vendor has the responsibility to maintain their drivers to ceph and take care the performance. > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Wednesday, October 21, 2015 9:36 PM > To: Allen Samuels; Sage Weil; Chen, Xiaoxi > Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Thanks Allen! The devil is always in the details. Know of anything else that > looks promising? > > Mark > > On 10/21/2015 05:06 AM, Allen Samuels wrote: > > I doubt that NVMKV will be useful for two reasons: > > > > (1) It relies on the unique sparse-mapping addressing capabilities of > > the FusionIO VSL interface, it won't run on standard SSDs > > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no > range operations on keys). This is pretty much required for deep scrubbing. > > > > > > Allen Samuels > > Software Architect, Fellow, Systems and Software Solutions > > > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson > > Sent: Tuesday, October 20, 2015 6:20 AM > > To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com> > > Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy > > <somnath@sandisk.com>; ceph-devel@vger.kernel.org > > Subject: Re: newstore direction > > > > On 10/20/2015 07:30 AM, Sage Weil wrote: > >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: > >>> +1, nowadays K-V DB care more about very small key-value pairs, say > >>> several bytes to a few KB, but in SSD case we only care about 4KB or > >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD > >>> vendor are also trying to build this kind of interface, we had a > >>> NVM-L library but still under development. > >> > >> Do you have an NVMKV link? I see a paper and a stale github repo.. > >> not sure if I'm looking at the right thing. > >> > >> My concern with using a key/value interface for the object data is > >> that you end up with lots of key/value pairs (e.g., $inode_$offset = > >> $4kb_of_data) that is pretty inefficient to store and (depending on > >> the > >> implementation) tends to break alignment. I don't think these > >> interfaces are targetted toward block-sized/aligned payloads. > >> Storing just the metadata (block allocation map) w/ the kv api and > >> storing the data directly on a block/page interface makes more sense to > me. > >> > >> sage > > > > I get the feeling that some of the folks that were involved with nvmkv at > Fusion IO have left. Nisha Talagala is now out at Parallel Systems for > instance. > http://pmem.io might be a better bet, though I haven't looked closely at it. > > > > Mark > > > >> > >> > >>>> -Original Message- > >>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > >>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > >>>> Sent: Tuesday, October 20, 2015 6:21 AM > >>>> To: Sage Weil; Somnath Roy > >>>> Cc: ceph-devel@vger.kernel.org > >>>> Subject: RE: newstore direction > >>>> > >>>> Hi Sage and Somnath, > >>>> In my humble opinion, There is another more aggressive > >>>> solution than raw block device base keyvalue store as backend for > >>>> objectstore. The new key value SSD device with transaction support > would be ideal to solve the issues. > >>>> First of all, it is raw SSD device. Secondly , It provides key > >>
Re: newstore direction
M trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time. In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state. I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won
Re: newstore direction
Thanks Allen! The devil is always in the details. Know of anything else that looks promising? Mark On 10/21/2015 05:06 AM, Allen Samuels wrote: I doubt that NVMKV will be useful for two reasons: (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Tuesday, October 20, 2015 6:20 AM To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com> Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy <somnath@sandisk.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/20/2015 07:30 AM, Sage Weil wrote: On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: +1, nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still under development. Do you have an NVMKV link? I see a paper and a stale github repo.. not sure if I'm looking at the right thing. My concern with using a key/value interface for the object data is that you end up with lots of key/value pairs (e.g., $inode_$offset = $4kb_of_data) that is pretty inefficient to store and (depending on the implementation) tends to break alignment. I don't think these interfaces are targetted toward block-sized/aligned payloads. Storing just the metadata (block allocation map) w/ the kv api and storing the data directly on a block/page interface makes more sense to me. sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI Sent: Tuesday, October 20, 2015 6:21 AM To: Sage Weil; Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction Hi Sage and Somnath, In my humble opinion, There is another more aggressive solution than raw block device base keyvalue store as backend for objectstore. The new key value SSD device with transaction support would be ideal to solve the issues. First of all, it is raw SSD device. Secondly , It provides key value interface directly from SSD. Thirdly, it can provide transaction support, consistency will be guaranteed by hardware device. It pretty much satisfied all of objectstore needs without any extra overhead since there is not any extra layer in between device and objectstore. Either way, I strongly support to have CEPH own data format instead of relying on filesystem. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 1:55 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction On Mon, 19 Oct 2015, Somnath Roy wrote: Sage, I fully support that. If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring). Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree- based one for high-end flash). sage Thanks & Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 12:49 PM To: ceph-devel@vger.kernel.org Subject: newstore direction The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is cu
Re: newstore direction
swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time. In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state. I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those aren't currently fungible. - We have to write and maintain an allocator. I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized). For disk
Re: newstore direction
On Wed, 21 Oct 2015, Ric Wheeler wrote: > On 10/21/2015 04:22 AM, Orit Wasserman wrote: > > On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote: > > > On 10/19/2015 03:49 PM, Sage Weil wrote: > > > > The current design is based on two simple ideas: > > > > > > > >1) a key/value interface is better way to manage all of our internal > > > > metadata (object metadata, attrs, layout, collection membership, > > > > write-ahead logging, overlay data, etc.) > > > > > > > >2) a file system is well suited for storage object data (as files). > > > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > > > few > > > > things: > > > > > > > >- We currently write the data to the file, fsync, then commit the kv > > > > transaction. That's at least 3 IOs: one for the data, one for the fs > > > > journal, one for the kv txn to commit (at least once my rocksdb changes > > > > land... the kv commit is currently 2-3). So two people are managing > > > > metadata, here: the fs managing the file metadata (with its own > > > > journal) and the kv backend (with its journal). > > > If all of the fsync()'s fall into the same backing file system, are you > > > sure > > > that each fsync() takes the same time? Depending on the local FS > > > implementation > > > of course, but the order of issuing those fsync()'s can effectively make > > > some of > > > them no-ops. > > > > > > >- On read we have to open files by name, which means traversing the > > > > fs > > > > namespace. Newstore tries to keep it as flat and simple as possible, > > > > but > > > > at a minimum it is a couple btree lookups. We'd love to use open by > > > > handle (which would reduce this to 1 btree traversal), but running > > > > the daemon as ceph and not root makes that hard... > > > This seems like a a pretty low hurdle to overcome. > > > > > > >- ...and file systems insist on updating mtime on writes, even when > > > > it is > > > > a overwrite with no allocation changes. (We don't care about mtime.) > > > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > > > brainfreeze. > > > Are you using O_DIRECT? Seems like there should be some enterprisey > > > database > > > tricks that we can use here. > > > > > > >- XFS is (probably) never going going to give us data checksums, > > > > which we > > > > want desperately. > > > What is the goal of having the file system do the checksums? How strong do > > > they > > > need to be and what size are the chunks? > > > > > > If you update this on each IO, this will certainly generate more IO (each > > > write > > > will possibly generate at least one other write to update that new > > > checksum). > > > > > > > But what's the alternative? My thought is to just bite the bullet and > > > > consume a raw block device directly. Write an allocator, hopefully keep > > > > it pretty simple, and manage it in kv store along with all of our other > > > > metadata. > > > The big problem with consuming block devices directly is that you > > > ultimately end > > > up recreating most of the features that you had in the file system. Even > > > enterprise databases like Oracle and DB2 have been migrating away from > > > running > > > on raw block devices in favor of file systems over time. In effect, you > > > are > > > looking at making a simple on disk file system which is always easier to > > > start > > > than it is to get back to a stable, production ready state. > > The best performance is still on block device (SAN). > > File system simplify the operation tasks which worth the performance > > penalty for a database. I think in a storage system this is not the > > case. > > In many cases they can use their own file system that is tailored for > > the database. > > You will have to trust me on this as the Red Hat person who spoke to pretty > much all of our key customers about local file systems and storage - customers > all have migrated over to using normal file systems under Oracle/DB2. > Typically, they use XFS or ext4. I don't know of any non-standard file > systems and only have seen one account running on a raw block store in 8 years > :) > > If you have a pre-allocated file and write using O_DIRECT, your IO path is > identical in terms of IO's sent to the device. ...except it's not. Preallocating the file gives you contiguous space, but you still have to mark the extent written (not zero/prealloc). The only way to get an identical IO pattern is to *pre-write* zeros (or whatever) to the file... which is hours on modern HDDs. Ted asked for a way to force prealloc to expose preexisting disk bits a couple hears back at LSF and it was shot down for security reasons (and rightly so, IMO). If you're going down this path, you already have a "file system" in user space sitting on top of the preallocated file, and you could just as easily use the block device directly. If you're not, then you're writing smaller files (e.g.,
Re: newstore direction
5-10/msg00545.html rolled out into RHEL/CentOS/Ubuntu. I have no idea how long these things typically take, but this might be a good test case. How quickly things land in a distro is up to the interested parties making the case for it. My thought is that there is some inflection point where the userland kvstore/block approach is going to be less work, for everyone I think, than trying to quickly discover, understand, fix, and push upstream patches that sometimes only really benefit us. I don't know if we've truly hit that that point, but it's tough for me to find flaws with Sage's argument. Ric Regards, Ric Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. The big problem with consuming block devices directly is that you ultimately end up recreating most of the features t
Re: newstore direction
Adding 2c On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > My thought is that there is some inflection point where the userland > kvstore/block approach is going to be less work, for everyone I think, > than trying to quickly discover, understand, fix, and push upstream > patches that sometimes only really benefit us. I don't know if we've > truly hit that that point, but it's tough for me to find flaws with > Sage's argument. Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread: In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic. Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency. (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics). Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e. waiting for the next distro release before being able to take up the benefits of improvements to the storage code. A random google came up with related data on where "doing something way different" /can/ have significant benefits: http://phunq.net/pipermail/tux3/2015-April/002147.html I (FWIW) certainly agree there is merit to the idea. The scientific approach here could perhaps be to simply enumerate all corner cases of "generic FS" that actually are cause for the experienced issues, and assess probability of them being solved (and if so when). That *could* improve chances of approaching consensus which wouldn't hurt I suppose? BR, Martin -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I am pushing internally to open-source ZetaScale. Recent events may or may not affect that trajectory -- stay tuned. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Wednesday, October 21, 2015 10:45 PM To: Allen Samuels <allen.samu...@sandisk.com>; Ric Wheeler <rwhee...@redhat.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 05:06 AM, Allen Samuels wrote: > I agree that moving newStore to raw block is going to be a significant > development effort. But the current scheme of using a KV store combined with > a normal file system is always going to be problematic (FileStore or > NewStore). This is caused by the transactional requirements of the > ObjectStore interface, essentially you need to make transactionally > consistent updates to two indexes, one of which doesn't understand > transactions (File Systems) and can never be tightly-connected to the other > one. > > You'll always be able to make this "loosely coupled" approach work, but it > will never be optimal. The real question is whether the performance > difference of a suboptimal implementation is something that you can live with > compared to the longer gestation period of the more optimal implementation. > Clearly, Sage believes that the performance difference is significant or he > wouldn't have kicked off this discussion in the first place. > > While I think we can all agree that writing a full-up KV and raw-block > ObjectStore is a significant amount of work. I will offer the case that the > "loosely couple" scheme may not have as much time-to-market advantage as it > appears to have. One example: NewStore performance is limited due to bugs in > XFS that won't be fixed in the field for quite some time (it'll take at least > a couple of years before a patched version of XFS will be widely deployed at > customer environments). > > Another example: Sage has just had to substantially rework the journaling > code of rocksDB. > > In short, as you can tell, I'm full throated in favor of going down the > optimal route. > > Internally at Sandisk, we have a KV store that is optimized for flash (it's > called ZetaScale). We have extended it with a raw block allocator just as > Sage is now proposing to do. Our internal performance measurements show a > significant advantage over the current NewStore. That performance advantage > stems primarily from two things: Has there been any discussion regarding opensourcing zetascale? > > (1) ZetaScale uses a B+-tree internally rather than an LSM tree > (levelDB/RocksDB). LSM trees experience exponential increase in write > amplification (cost of an insert) as the amount of data under management > increases. B+tree write-amplification is nearly constant independent of the > size of data under management. As the KV database gets larger (Since newStore > is effectively moving the per-file inode into the kv data base. Don't forget > checksums that Sage want's to add :)) this performance delta swamps all > others. > (2) Having a KV and a file-system causes a double lookup. This costs CPU time > and disk accesses to page in data structure indexes, metadata efficiency > decreases. > > You can't avoid (2) as long as you're using a file system. > > Yes an LSM tree performs better on HDD than does a B-tree, which is a good > argument for keeping the KV module pluggable. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler > Sent: Tuesday, October 20, 2015 11:32 AM > To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On 10/19/2015 03:49 PM, Sage Weil wrote: >> The current design is based on two simple ideas: >> >>1) a key/value interface is better way to manage all of our >> internal metadata (object metadata, attrs, layout, collection >> membership, write-ahead logging, overlay data, etc.) >> >>2) a file system is well suited for storage object data (as files). >> >> So far 1 is working out well, but I'm questioning the wisdom of #2. >> A few >> things: >> >>- We currently write the data to the file, fsync, then commit the >> kv transaction. That's
RE: newstore direction
One of the biggest changes that flash is making in the storage world is that the way basic trade-offs in storage management software architecture are being affected. In the HDD world CPU time per IOP was relatively inconsequential, i.e., it had little effect on overall performance which was limited by the physics of the hard drive. Flash is now inverting that situation. When you look at the performance levels being delivered in the latest generation of NVMe SSDs you rapidly see that that storage itself is generally no longer the bottleneck (speaking about BW, not latency of course) but rather it's the system sitting in front of the storage that is the bottleneck. Generally it's the CPU cost of an IOP. When Sandisk first starting working with Ceph (Dumpling) the design of librados and the OSD lead to the situation that the CPU cost of an IOP was dominated by context switches and network socket handling. Over time, much of that has been addressed. The socket handling code has been re-written (more than once!) some of the internal queueing in the OSD (and the associated context switches) have been eliminated. As the CPU costs have dropped, performance on flash has improved accordingly. Because we didn't want to completely re-write the OSD (time-to-market and stability drove that decision), we didn't move it from the current "thread per IOP" model into a truly asynchronous "thread per CPU core" model that essentially eliminates context switches in the IO path. But a fully optimized OSD would go down that path (at least part-way). I believe it's been proposed in the past. Perhaps a hybrid "fast-path" style could get most of the benefits while preserving much of the legacy code. I believe this trend toward thread-per-core software development will also tend to support the "do it in user-space" trend. That's because most of the kernel and file-system interface is architected around the blocking "thread-per-IOP" model and is unlikely to change in the future. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Martin Millnert [mailto:mar...@millnert.se] Sent: Thursday, October 22, 2015 6:20 AM To: Mark Nelson <mnel...@redhat.com> Cc: Ric Wheeler <rwhee...@redhat.com>; Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction Adding 2c On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > My thought is that there is some inflection point where the userland > kvstore/block approach is going to be less work, for everyone I think, > than trying to quickly discover, understand, fix, and push upstream > patches that sometimes only really benefit us. I don't know if we've > truly hit that that point, but it's tough for me to find flaws with > Sage's argument. Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread: In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic. Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency. (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics). Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e. waiting for the next distro release before being able to take up the benefits of improvements to the storage code. A random google came up with related data on where "doing something way different" /can/ have significant benefits: http://phunq.net/pipermail/tux3/2015-April/002147.ht
RE: newstore direction
Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Wednesday, October 21, 2015 8:24 PM To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 06:06 AM, Allen Samuels wrote: > I agree that moving newStore to raw block is going to be a significant > development effort. But the current scheme of using a KV store combined with > a normal file system is always going to be problematic (FileStore or > NewStore). This is caused by the transactional requirements of the > ObjectStore interface, essentially you need to make transactionally > consistent updates to two indexes, one of which doesn't understand > transactions (File Systems) and can never be tightly-connected to the other > one. > > You'll always be able to make this "loosely coupled" approach work, but it > will never be optimal. The real question is whether the performance > difference of a suboptimal implementation is something that you can live with > compared to the longer gestation period of the more optimal implementation. > Clearly, Sage believes that the performance difference is significant or he > wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. > > While I think we can all agree that writing a full-up KV and raw-block > ObjectStore is a significant amount of work. I will offer the case that the > "loosely couple" scheme may not have as much time-to-market advantage as it > appears to have. One example: NewStore performance is limited due to bugs in > XFS that won't be fixed in the field for quite some time (it'll take at least > a couple of years before a patched version of XFS will be widely deployed at > customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Regards, Ric > > Another example: Sage has just had to substantially rework the journaling > code of rocksDB. > > In short, as you can tell, I'm full throated in favor of going down the > optimal route. > > Internally at Sandisk, we have a KV store that is optimized for flash (it's > called ZetaScale). We have extended it with a raw block allocator just as > Sage is now proposing to do. Our internal performance measurements show a > significant advantage over the current NewStore. That performance advantage > stems primarily from two things: > > (1) ZetaScale uses a B+-tree internally rather than an LSM tree > (levelDB/RocksDB). LSM trees experience exponential increase in write > amplification (cost of an insert) as the amount of data under management > increases. B+tree write-amplification is nearly constant independent of the > size of data under management. As the KV database gets larger (Since newStore > is effectively moving the per-file inode into the kv data base. Don't forget > checksums that Sage want's to add :)) this performance delta swamps all > others. > (2) Having a KV and a file-system causes a double lookup. This costs CPU time > and disk accesses to page in data structure indexes, metadata efficiency > decreases. > > You can't avoid (2) as long as you're using a file system. > > Yes an LSM tree performs better on HDD than does a B-tree, which is a good > argument for keeping the KV module pluggable. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > -Original Message----- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org]
RE: newstore direction
Actually Range queries are an important part of the performance story and random read speed doesn't really solve the problem. When you're doing a scrub, you need to enumerate the objects in a specific order on multiple nodes -- so that they can compare the contents of their stores in order to determine if data cleaning needs to take place. If you don't have in-order enumeration in your basic data structure (which NVMKV doesn't have) then you're forced to sort the directory before you can respond to an enumeration. That sort will either consume huge amounts of IOPS OR huge amounts of DRAM. Regardless of the choice, you'll see a significant degradation of performance while the scrub is ongoing -- which is one of the biggest problems with clustered systems (expensive and extensive maintenance operations). Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] Sent: Thursday, October 22, 2015 1:10 AM To: Mark Nelson <mnel...@redhat.com>; Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com> Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy <somnath@sandisk.com>; ceph-devel@vger.kernel.org Subject: RE: newstore direction We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon). But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable. Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough. Anyway, I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure. a vendor has the responsibility to maintain their drivers to ceph and take care the performance. > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Wednesday, October 21, 2015 9:36 PM > To: Allen Samuels; Sage Weil; Chen, Xiaoxi > Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Thanks Allen! The devil is always in the details. Know of anything > else that looks promising? > > Mark > > On 10/21/2015 05:06 AM, Allen Samuels wrote: > > I doubt that NVMKV will be useful for two reasons: > > > > (1) It relies on the unique sparse-mapping addressing capabilities > > of the FusionIO VSL interface, it won't run on standard SSDs > > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no > range operations on keys). This is pretty much required for deep scrubbing. > > > > > > Allen Samuels > > Software Architect, Fellow, Systems and Software Solutions > > > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson > > Sent: Tuesday, October 20, 2015 6:20 AM > > To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi > > <xiaoxi.c...@intel.com> > > Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy > > <somnath@sandisk.com>; ceph-devel@vger.kernel.org > > Subject: Re: newstore direction > > > > On 10/20/2015 07:30 AM, Sage Weil wrote: > >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: > >>> +1, nowadays K-V DB care more about very small key-value pairs, > >>> +say > >>> several bytes to a few KB, but in SSD case we only care about 4KB > >>> or 8KB. In this way, NVMKV is a good design and seems some of the > >>> SSD vendor are also trying to build this kind of interface, we had > >>> a NVM-L library but still under development. > >> > >> Do you have an NVMKV link? I see a paper and a stale github repo.. > >> not sure if I'm looking at the right thing. > >> > >> My concern with using a key/value interface for the object data is > >> that you end up with lots of key/value pairs (e.g., $inode_$offset > >> = > >> $4kb_of_data) that is pretty inefficient to store and (depending on > >> the > >> implementation) tends to break alignment. I don't think these > >> interfaces are targetted toward block-
Re: newstore direction
On 10/21/2015 08:53 PM, Allen Samuels wrote: Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Thursday, October 22, 2015 10:17 AM To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 08:53 PM, Allen Samuels wrote: > Fixing the bug doesn't take a long time. Getting it deployed is where the > delay is. Many companies standardize on a particular release of a particular > distro. Getting them to switch to a new release -- even a "bug fix" point > release -- is a major undertaking that often is a complete roadblock. Just my > experience. YMMV. > Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Tue, 20 Oct 2015, Ric Wheeler wrote: > > Now: > > 1 io to write a new file > >1-2 ios to sync the fs journal (commit the inode, alloc change) > >(I see 2 journal IOs on XFS and only 1 on ext4...) > > 1 io to commit the rocksdb journal (currently 3, but will drop to > >1 with xfs fix and my rocksdb change) > > I think that might be too pessimistic - the number of discrete IO's sent down > to a spinning disk make much less impact on performance than the number of > fsync()'s since they IO's all land in the write cache. Some newer spinning > drives have a non-volatile write cache, so even an fsync() might not end up > doing the expensive data transfer to the platter. True, but in XFS's case at least the file data and journal are not colocated, so its 2 seeks for the new file write+fdatasync and another for the rocksdb journal commit. Of course, with a deep queue, we're doing lots of these so there's be fewer journal commits on both counts, but the lower bound on latency of a single write is still 3 seeks, and that bound is pretty critical when you also have network round trips and replication (worst out of 2) on top. > It would be interesting to get the timings on the IO's you see to measure the > actual impact. I observed this with the journaling workload for rocksdb, but I assume the journaling behavior is the same regardless of what is being journaled. For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and blktrace showed an IO to the file, and 2 IOs to the journal. I believe the first one is the record for the inode update, and the second is the journal 'commit' record (though I forget how I decided that). My guess is that XFS is being extremely careful about journal integrity here and not writing the commit record until it knows that the preceding records landed on stable storage. For ext4, the latency was about ~20ms, and blktrace showed the IO to the file and then a single journal IO. When I made the rocksdb change to overwrite an existing, prewritten file, the latency dropped to ~10ms on ext4, and blktrace showed a single IO as expected. (XFS still showed the 2 journal commit IOs, but Dave just posted the fix for that on the XFS list today.) > Plumbing for T10 DIF/DIX already exist, what is missing is the normal block > device that handles them (not enterprise SAS/disk array class) Yeah... which unfortunately means that unless the cheap drives suddenly start shipping if DIF/DIX support we'll need to do the checksums ourselves. This is probably a good thing anyway as it doesn't constrain our choice of checksum or checksum granularity, and will still work with other storage devices (ssds, nvme, etc.). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb > changes land... the kv commit is currently 2-3). So two people are > managing metadata, here: the fs managing the file metadata (with its > own > journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. > > - On read we have to open files by name, which means traversing the > fs namespace. Newstore tries to keep it as flat and simple as > possible, but at a minimum it is a couple btree lookups. We'd love to > use open by handle (which would reduce this to 1 btree traversal), but > running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. > > - ...and file systems insist on updating mtime on writes, even when > it is a overwrite with no
RE: newstore direction
I doubt that NVMKV will be useful for two reasons: (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Tuesday, October 20, 2015 6:20 AM To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com> Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy <somnath@sandisk.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/20/2015 07:30 AM, Sage Weil wrote: > On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: >> +1, nowadays K-V DB care more about very small key-value pairs, say >> several bytes to a few KB, but in SSD case we only care about 4KB or >> 8KB. In this way, NVMKV is a good design and seems some of the SSD >> vendor are also trying to build this kind of interface, we had a >> NVM-L library but still under development. > > Do you have an NVMKV link? I see a paper and a stale github repo.. > not sure if I'm looking at the right thing. > > My concern with using a key/value interface for the object data is > that you end up with lots of key/value pairs (e.g., $inode_$offset = > $4kb_of_data) that is pretty inefficient to store and (depending on > the > implementation) tends to break alignment. I don't think these > interfaces are targetted toward block-sized/aligned payloads. Storing > just the metadata (block allocation map) w/ the kv api and storing the > data directly on a block/page interface makes more sense to me. > > sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark > > >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI >>> Sent: Tuesday, October 20, 2015 6:21 AM >>> To: Sage Weil; Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> Hi Sage and Somnath, >>>In my humble opinion, There is another more aggressive solution >>> than raw block device base keyvalue store as backend for >>> objectstore. The new key value SSD device with transaction support would >>> be ideal to solve the issues. >>> First of all, it is raw SSD device. Secondly , It provides key value >>> interface directly from SSD. Thirdly, it can provide transaction >>> support, consistency will be guaranteed by hardware device. It >>> pretty much satisfied all of objectstore needs without any extra >>> overhead since there is not any extra layer in between device and >>> objectstore. >>> Either way, I strongly support to have CEPH own data format >>> instead of relying on filesystem. >>> >>>Regards, >>>James >>> >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>> ow...@vger.kernel.org] On Behalf Of Sage Weil >>> Sent: Monday, October 19, 2015 1:55 PM >>> To: Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> On Mon, 19 Oct 2015, Somnath Roy wrote: >>>> Sage, >>>> I fully support that. If we want to saturate SSDs , we need to get >>>> rid of this filesystem overhead (which I am in process of measuring). >>>> Also, it will be good if we can eliminate the dependency on the k/v >>>> dbs (for storing allocators and all). The reason is the unknown >>>> write amps they causes. >>> >>> My hope is to keep behing the KeyValueDB interface (and/more change >>> it as >>> appropriate) so that other backends can be easily swapped in (e.g. a >>> btree- based one for high-end flash). >>> >>> sage >>> >>> >>>> >>>> Thanks & Regards >>>> Somnath >>>> >>>> >>>> -Original Message- >>>> From: ceph-devel-ow...@vger.kernel.org >>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behal
Re: newstore direction
On 10/21/2015 06:06 AM, Allen Samuels wrote: I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Regards, Ric Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fs
Re: newstore direction
On 10/21/2015 04:22 AM, Orit Wasserman wrote: On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote: On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time. In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state. The best performance is still on block device (SAN). File system simplify the operation tasks which worth the performance penalty for a database. I think in a storage system this is not the case. In many cases they can use their own file system that is tailored for the database. You will have to trust me on this as the Red Hat person who spoke to pretty much all of our key customers about local file systems and storage - customers all have migrated over to using normal file systems under Oracle/DB2. Typically, they use XFS or ext4. I don't know of any non-standard file systems and only have seen one account running on a raw block store in 8 years :) If you have a pre-allocated file and write using O_DIRECT, your IO path is identical in terms of IO's sent to the device. If we are causing additional IO's, then we really need to spend some time talking to the local file system gurus about this in detail. I can help with that conversation. I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those
Re: newstore direction
On Tue, 2015-10-20 at 14:31 -0400, Ric Wheeler wrote: > On 10/19/2015 03:49 PM, Sage Weil wrote: > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb changes > > land... the kv commit is currently 2-3). So two people are managing > > metadata, here: the fs managing the file metadata (with its own > > journal) and the kv backend (with its journal). > > If all of the fsync()'s fall into the same backing file system, are you sure > that each fsync() takes the same time? Depending on the local FS > implementation > of course, but the order of issuing those fsync()'s can effectively make some > of > them no-ops. > > > > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a minimum it is a couple btree lookups. We'd love to use open by > > handle (which would reduce this to 1 btree traversal), but running > > the daemon as ceph and not root makes that hard... > > This seems like a a pretty low hurdle to overcome. > > > > > - ...and file systems insist on updating mtime on writes, even when it is > > a overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > Are you using O_DIRECT? Seems like there should be some enterprisey database > tricks that we can use here. > > > > > - XFS is (probably) never going going to give us data checksums, which we > > want desperately. > > What is the goal of having the file system do the checksums? How strong do > they > need to be and what size are the chunks? > > If you update this on each IO, this will certainly generate more IO (each > write > will possibly generate at least one other write to update that new checksum). > > > > > But what's the alternative? My thought is to just bite the bullet and > > consume a raw block device directly. Write an allocator, hopefully keep > > it pretty simple, and manage it in kv store along with all of our other > > metadata. > > The big problem with consuming block devices directly is that you ultimately > end > up recreating most of the features that you had in the file system. Even > enterprise databases like Oracle and DB2 have been migrating away from > running > on raw block devices in favor of file systems over time. In effect, you are > looking at making a simple on disk file system which is always easier to > start > than it is to get back to a stable, production ready state. The best performance is still on block device (SAN). File system simplify the operation tasks which worth the performance penalty for a database. I think in a storage system this is not the case. In many cases they can use their own file system that is tailored for the database. > I think that it might be quicker and more maintainable to spend some time > working with the local file system people (XFS or other) to see if we can > jointly address the concerns you have. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, one to commit our transaction (vs 4+ before). For overwrites, > > we'd have one io to do our write-ahead log (kv journal), then do > > the overwrite async (vs 4+ before). > > > > - No concern about mtime getting in the way > > > > - Faster reads (no fs lookup) > > > > - Similarly sized metadata for most objects. If we assume most objects > > are not fragmented, then the metadata to store the block offsets is about > > the same size as the metadata to store the filenames we have now. > > > > Problems: > > > > - We have to size the kv backend storage (probably still an XFS > > partition) vs the block storage. Maybe we do this anyway (put metadata on > > SSD!) so it won't matter. But what happens when we are storing gobs of > > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > > a different pool and those aren't currently fungible. > > > > - We have to write and maintain an allocator. I'm still optimistic this > > can be reasonbly simple, especially for the flash case (where > > fragmentation isn't such an issue as long as our blocks are reasonbly > > sized). For disk we may beed to be moderately clever. > > > > - We'll need a fsck to ensure our internal metadata is consistent. The > > good news is it'll just need to validate
RE: newstore direction
> -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 9:49 PM > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal metadata > (object metadata, attrs, layout, collection membership, write-ahead logging, > overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > [..] > > But what's the alternative? My thought is to just bite the bullet and consume > a raw block device directly. Write an allocator, hopefully keep it pretty > simple, and manage it in kv store along with all of our other metadata. This is pretty much reinventing the file system, but... I actually did something similar for my personal project (e-mail client), moving from maildir-like structure (each message was one file) to something resembling mbox (one large file per mail folder, containing pre-decoded structures for fast and easy access). And this worked out really well, especially with searches and bulk processing (filtering by body contents, and so on). I don't remember exact figures, but the performance benefit was in at least order of magnitude. If huge amounts of small-to-medium (0-128k) objects are the target, this is the way to go. The most serious issue was fragmentation. Since I actually put my box files on top of actual FS (here: NTFS), low-level fragmentation was not a problem (each message was read and written in one fread/fwrite anyway). High-level fragmentation was an issue - each time a message was moved away, it still occupied space. To combat this, I wrote a space reclaimer that moved messages within box file (consolidated them) and maintained a bitmap of 4k free spaces, so I could re-use unused space without taking too much time iterating through messages and without calling reclaimer. Also, reclaimer was smart enough to not move messages one-by-one, but instead it loaded up to n messages in at most n reads (in common case it was less than that) and wrote them in one call and do its work until some space was actually reclaimed, instead of doing full garbage collection. Machinery was also aware of fact that messages were (mostly) appended to the end of box, so instead of blindly doing that, it moved back end-of-box pointer once messages at the end of box were deleted. Other issue was reliability. Obviously, I had an option of secondary temp file, but still, everything above is doable without that. Benefits included reduced requirements for metadata storage. Instead of generating unique ID (filename) for each message (apparently, message-id header is not reliable in that regard), I just stored offset and size (8+4 bytes per message), which, for 300 thousand of messages calculated to just 3,5MB of memory and could be kept in RAM. I/O performance has also improved due to less random access pattern (messages were physically close to each other instead of being scattered all over the drive) For Ceph, benefits could be even greater. I can imagine faster deep scrubs that are way more efficient on spinning drives; efficient object storage (no per-object fragmentation and less disk-intensive object readahead, maybe with better support from hardware); possibly more reliability (when we fsync, we actually fsync - we don't get cheated by underlying FS), and we could get it optimized for particular devices (for example, most SSDs suck like vacuum on I/Os below 4k, so we could enforce I/Os of at least 4k). Just my 0.02$. With best regards / Pozdrawiam Piotr Dałek -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Tue, 20 Oct 2015, Haomai Wang wrote: > On Tue, Oct 20, 2015 at 3:49 AM, Sage Weilwrote: > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb changes > > land... the kv commit is currently 2-3). So two people are managing > > metadata, here: the fs managing the file metadata (with its own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a minimum it is a couple btree lookups. We'd love to use open by > > handle (which would reduce this to 1 btree traversal), but running > > the daemon as ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is > > a overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > > consume a raw block device directly. Write an allocator, hopefully keep > > it pretty simple, and manage it in kv store along with all of our other > > metadata. > > This is really a tough decision. Although making a block device based > objectstore never walk out my mind since two years ago. > > We would much more concern about the effective of space utilization > compared to local fs, the buggy, the consuming time to build a tiny > local filesystem. I'm a little afraid of we would stuck into > > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, one to commit our transaction (vs 4+ before). For overwrites, > > we'd have one io to do our write-ahead log (kv journal), then do > > the overwrite async (vs 4+ before). > > Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL > area from my perf. With this change it is close to parity: https://github.com/facebook/rocksdb/pull/746 > > - No concern about mtime getting in the way > > > > - Faster reads (no fs lookup) > > > > - Similarly sized metadata for most objects. If we assume most objects > > are not fragmented, then the metadata to store the block offsets is about > > the same size as the metadata to store the filenames we have now. > > > > Problems: > > > > - We have to size the kv backend storage (probably still an XFS > > partition) vs the block storage. Maybe we do this anyway (put metadata on > > SSD!) so it won't matter. But what happens when we are storing gobs of > > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > > a different pool and those aren't currently fungible. > > > > - We have to write and maintain an allocator. I'm still optimistic this > > can be reasonbly simple, especially for the flash case (where > > fragmentation isn't such an issue as long as our blocks are reasonbly > > sized). For disk we may beed to be moderately clever. > > > > - We'll need a fsck to ensure our internal metadata is consistent. The > > good news is it'll just need to validate what we have stored in the kv > > store. > > > > Other thoughts: > > > > - We might want to consider whether dm-thin or bcache or other block > > layers might help us with elasticity of file vs block areas. > > > > - Rocksdb can push colder data to a second directory, so we could have a > > fast ssd primary area (for wal and most metadata) and a second hdd > > directory for stuff it has to push off. Then have a conservative amount > > of file space on the hdd. If our block fills up, use the existing file > > mechanism to put data there too. (But then we have to maintain both the > > current kv + file approach and not go all-in on kv + block.) > > A complex way... > > Actually I would like to employ FileStore2 impl, which means we still > use FileJournal(or alike ..). But we need to employ more memory to > keep metadata/xattrs and use aio+dio to flush disk. A userspace > pagecache needed to be impl. Then we can skip journal if full write, > because osd is pg isolation we could make a barrier for single pg when > skipping journal. @Sage Is there other concerns for filestore skip > journal? > > In a word, I like the model that filestore owns, but
RE: newstore direction
On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: > +1, nowadays K-V DB care more about very small key-value pairs, say > several bytes to a few KB, but in SSD case we only care about 4KB or > 8KB. In this way, NVMKV is a good design and seems some of the SSD > vendor are also trying to build this kind of interface, we had a NVM-L > library but still under development. Do you have an NVMKV link? I see a paper and a stale github repo.. not sure if I'm looking at the right thing. My concern with using a key/value interface for the object data is that you end up with lots of key/value pairs (e.g., $inode_$offset = $4kb_of_data) that is pretty inefficient to store and (depending on the implementation) tends to break alignment. I don't think these interfaces are targetted toward block-sized/aligned payloads. Storing just the metadata (block allocation map) w/ the kv api and storing the data directly on a block/page interface makes more sense to me. sage > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > > Sent: Tuesday, October 20, 2015 6:21 AM > > To: Sage Weil; Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > Hi Sage and Somnath, > > In my humble opinion, There is another more aggressive solution than raw > > block device base keyvalue store as backend for objectstore. The new key > > value SSD device with transaction support would be ideal to solve the > > issues. > > First of all, it is raw SSD device. Secondly , It provides key value > > interface > > directly from SSD. Thirdly, it can provide transaction support, consistency > > will > > be guaranteed by hardware device. It pretty much satisfied all of > > objectstore > > needs without any extra overhead since there is not any extra layer in > > between device and objectstore. > >Either way, I strongly support to have CEPH own data format instead of > > relying on filesystem. > > > > Regards, > > James > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > > ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 1:55 PM > > To: Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > > Sage, > > > I fully support that. If we want to saturate SSDs , we need to get > > > rid of this filesystem overhead (which I am in process of measuring). > > > Also, it will be good if we can eliminate the dependency on the k/v > > > dbs (for storing allocators and all). The reason is the unknown write > > > amps they causes. > > > > My hope is to keep behing the KeyValueDB interface (and/more change it as > > appropriate) so that other backends can be easily swapped in (e.g. a btree- > > based one for high-end flash). > > > > sage > > > > > > > > > > Thanks & Regards > > > Somnath > > > > > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > > Sent: Monday, October 19, 2015 12:49 PM > > > To: ceph-devel@vger.kernel.org > > > Subject: newstore direction > > > > > > The current design is based on two simple ideas: > > > > > > 1) a key/value interface is better way to manage all of our internal > > > metadata (object metadata, attrs, layout, collection membership, > > > write-ahead logging, overlay data, etc.) > > > > > > 2) a file system is well suited for storage object data (as files). > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > > few > > > things: > > > > > > - We currently write the data to the file, fsync, then commit the kv > > > transaction. That's at least 3 IOs: one for the data, one for the fs > > > journal, one for the kv txn to commit (at least once my rocksdb > > > changes land... the kv commit is currently 2-3). So two people are > > > managing metadata, here: the fs managing the file metadata (with its > > > own > > > journal) and the kv backend (with its journal). > > > > > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a > > minim
RE: newstore direction
On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote: > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive solution than > raw block device base keyvalue store as backend for objectstore. The new > key value SSD device with transaction support would be ideal to solve > the issues. First of all, it is raw SSD device. Secondly , It provides > key value interface directly from SSD. Thirdly, it can provide > transaction support, consistency will be guaranteed by hardware device. > It pretty much satisfied all of objectstore needs without any extra > overhead since there is not any extra layer in between device and > objectstore. Are you talking about open channel SSDs? Or something else? Everything I'm familiar with that is currently shipping is exposing a vanilla block interface (conventional SSDs) that hides all of that or NVMe (which isn't much better). If there is a low-level KV interface we can consume that would be great--especially if we can glue it to our KeyValueDB abstract API. Even so, we need to make sure that the object *data* also has an efficient API we can utilize that efficiently handles block-sized/aligned data. sage >Either way, I strongly support to have CEPH own data format instead > of relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > Sage, > > I fully support that. If we want to saturate SSDs , we need to get > > rid of this filesystem overhead (which I am in process of measuring). > > Also, it will be good if we can eliminate the dependency on the k/v > > dbs (for storing allocators and all). The reason is the unknown write > > amps they causes. > > My hope is to keep behing the KeyValueDB interface (and/more change it as > appropriate) so that other backends can be easily swapped in (e.g. a > btree-based one for high-end flash). > > sage > > > > > > Thanks & Regards > > Somnath > > > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 12:49 PM > > To: ceph-devel@vger.kernel.org > > Subject: newstore direction > > > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb > > changes land... the kv commit is currently 2-3). So two people are > > managing metadata, here: the fs managing the file metadata (with its > > own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a minimum it is a couple btree lookups. We'd love to use open by handle > > (which would reduce this to 1 btree traversal), but running the daemon as > > ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is > > a overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > > consume a raw block device directly. Write an allocator, hopefully keep it > > pretty simple, and manage it in kv store along with all of our other > > metadata. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, one to commit our transaction (vs 4+ before). For overwrites, we'd > > have one io to do our write-ahead log
Re: newstore direction
On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time. In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state. I think that it might be quicker and more maintainable to spend some time working with the local file system people (XFS or other) to see if we can jointly address the concerns you have. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those aren't currently fungible. - We have to write and maintain an allocator. I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized). For disk we may beed to be moderately clever. - We'll need a fsck to ensure our internal metadata is consistent. The good news is it'll just need to validate what we have stored in the kv store. Other thoughts: - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas. - Rocksdb can push colder data to a second directory, so we could have a fast ssd primary area (for wal and most metadata) and a second hdd directory for stuff it has to push off. Then have a conservative amount of file space on the hdd. If our block fills up, use the existing file mechanism to put data there too. (But then we have to maintain both the current kv + file approach and not go all-in on kv + block.) Thoughts? sage -- I really hate the
Re: newstore direction
On Tue, Oct 20, 2015 at 6:19 AM, Mark Nelson <mnel...@redhat.com> wrote: > On 10/20/2015 07:30 AM, Sage Weil wrote: >> >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: >>> >>> +1, nowadays K-V DB care more about very small key-value pairs, say >>> several bytes to a few KB, but in SSD case we only care about 4KB or >>> 8KB. In this way, NVMKV is a good design and seems some of the SSD >>> vendor are also trying to build this kind of interface, we had a NVM-L >>> library but still under development. >> >> >> Do you have an NVMKV link? I see a paper and a stale github repo.. not >> sure if I'm looking at the right thing. >> >> My concern with using a key/value interface for the object data is that >> you end up with lots of key/value pairs (e.g., $inode_$offset = >> $4kb_of_data) that is pretty inefficient to store and (depending on the >> implementation) tends to break alignment. I don't think these interfaces >> are targetted toward block-sized/aligned payloads. Storing just the >> metadata (block allocation map) w/ the kv api and storing the data >> directly on a block/page interface makes more sense to me. >> >> sage > > > I get the feeling that some of the folks that were involved with nvmkv at > Fusion IO have left. Nisha Talagala is now out at Parallel Systems for > instance. http://pmem.io might be a better bet, though I haven't looked > closely at it. > IMO pmem.io is more suited for SCM (Storage Class Memory) than for SSD's. If Newstore is target towards production deployments (Eventually replacing FileStore someday) then IMO I agree with sage, i.e. rely on a file system for doing block allocation. -Neo > Mark > > >> >> >>>> -Original Message- >>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI >>>> Sent: Tuesday, October 20, 2015 6:21 AM >>>> To: Sage Weil; Somnath Roy >>>> Cc: ceph-devel@vger.kernel.org >>>> Subject: RE: newstore direction >>>> >>>> Hi Sage and Somnath, >>>>In my humble opinion, There is another more aggressive solution than >>>> raw >>>> block device base keyvalue store as backend for objectstore. The new key >>>> value SSD device with transaction support would be ideal to solve the >>>> issues. >>>> First of all, it is raw SSD device. Secondly , It provides key value >>>> interface >>>> directly from SSD. Thirdly, it can provide transaction support, >>>> consistency will >>>> be guaranteed by hardware device. It pretty much satisfied all of >>>> objectstore >>>> needs without any extra overhead since there is not any extra layer in >>>> between device and objectstore. >>>> Either way, I strongly support to have CEPH own data format instead >>>> of >>>> relying on filesystem. >>>> >>>>Regards, >>>>James >>>> >>>> -Original Message- >>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>>> ow...@vger.kernel.org] On Behalf Of Sage Weil >>>> Sent: Monday, October 19, 2015 1:55 PM >>>> To: Somnath Roy >>>> Cc: ceph-devel@vger.kernel.org >>>> Subject: RE: newstore direction >>>> >>>> On Mon, 19 Oct 2015, Somnath Roy wrote: >>>>> >>>>> Sage, >>>>> I fully support that. If we want to saturate SSDs , we need to get >>>>> rid of this filesystem overhead (which I am in process of measuring). >>>>> Also, it will be good if we can eliminate the dependency on the k/v >>>>> dbs (for storing allocators and all). The reason is the unknown write >>>>> amps they causes. >>>> >>>> >>>> My hope is to keep behing the KeyValueDB interface (and/more change it >>>> as >>>> appropriate) so that other backends can be easily swapped in (e.g. a >>>> btree- >>>> based one for high-end flash). >>>> >>>> sage >>>> >>>> >>>>> >>>>> Thanks & Regards >>>>> Somnath >>>>> >>>>> >>>>> -Original Message- >>>>> From: ceph-devel-ow...@vger.kernel.org >>>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil >>>>> Sent: Monday, October 19, 2015 12:49 PM >>
Re: newstore direction
On Tue, 20 Oct 2015, Ric Wheeler wrote: > On 10/19/2015 03:49 PM, Sage Weil wrote: > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb changes > > land... the kv commit is currently 2-3). So two people are managing > > metadata, here: the fs managing the file metadata (with its own > > journal) and the kv backend (with its journal). > > If all of the fsync()'s fall into the same backing file system, are you sure > that each fsync() takes the same time? Depending on the local FS > implementation of course, but the order of issuing those fsync()'s can > effectively make some of them no-ops. Surely, yes, but the fact remains we are maintaining two journals: one internal to the fs that manages the allocation metadata, and one layered on top that handles the kv store's write stream. The lower bound on any write is 3 IOs (unless we're talking about a COW fs). > > - On read we have to open files by name, which means traversing the fs > > namespace. Newstore tries to keep it as flat and simple as possible, but > > at a minimum it is a couple btree lookups. We'd love to use open by > > handle (which would reduce this to 1 btree traversal), but running > > the daemon as ceph and not root makes that hard... > > This seems like a a pretty low hurdle to overcome. I wish you luck convincing upstream to allow unprivileged access to open_by_handle or the XFS ioctl. :) But even if we had that, any object access requires multiple metadata lookups: one in our kv db, and a second to get the inode for the backing file. Again, there's an unnecessary lower bound on the number of IOs needed to access a cold object. > > - ...and file systems insist on updating mtime on writes, even when it is > > a overwrite with no allocation changes. (We don't care about mtime.) > > O_NOCMTIME patches exist but it is hard to get these past the kernel > > brainfreeze. > > Are you using O_DIRECT? Seems like there should be some enterprisey database > tricks that we can use here. It's not about about the data path, but avoiding the useless bookkeeping the file system is doing that we don't want or need. See the recent recent reception of Zach's O_NOCMTIME patches on linux-fsdevel: http://marc.info/?t=14309496981=1=2 I'm generally an optimist when it comes to introducing new APIs upstream, but I still found this to be an unbelievingly frustrating exchange. > > - XFS is (probably) never going going to give us data checksums, which we > > want desperately. > > What is the goal of having the file system do the checksums? How strong do > they need to be and what size are the chunks? > > If you update this on each IO, this will certainly generate more IO (each > write will possibly generate at least one other write to update that new > checksum). Not if we keep the checksums with the allocation metadata, in the onode/inode, which we're also doing and IO to persist. But whther that is practial depends on the granularity (4KB or 16K or 128K or ...), which may in turn depend on the object (RBD block that'll service random 4K reads and writes? or RGW fragment that is always written sequentially?). I'm highly skeptical we'd ever get anything from a general-purpose file system that would work well here (if anything at all). > > But what's the alternative? My thought is to just bite the bullet and > > consume a raw block device directly. Write an allocator, hopefully keep > > it pretty simple, and manage it in kv store along with all of our other > > metadata. > > The big problem with consuming block devices directly is that you ultimately > end up recreating most of the features that you had in the file system. Even > enterprise databases like Oracle and DB2 have been migrating away from running > on raw block devices in favor of file systems over time. In effect, you are > looking at making a simple on disk file system which is always easier to start > than it is to get back to a stable, production ready state. This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had everything we were implementing and more: mainly, copy on write and data checksums. But in practice the fact that its general purpose means it targets a very different workloads and APIs than what we need. Now that I've realized the POSIX file namespace is a bad fit for what we need and opted to manage that directly, things are
Re: newstore direction
Adding to this, On Tue, 2015-10-20 at 05:34 -0700, Sage Weil wrote: > On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote: > > Hi Sage and Somnath, > > In my humble opinion, There is another more aggressive solution than > > raw block device base keyvalue store as backend for objectstore. The new > > key value SSD device with transaction support would be ideal to solve > > the issues. First of all, it is raw SSD device. Secondly , It provides > > key value interface directly from SSD. Thirdly, it can provide > > transaction support, consistency will be guaranteed by hardware device. > > It pretty much satisfied all of objectstore needs without any extra > > overhead since there is not any extra layer in between device and > > objectstore. > > Are you talking about open channel SSDs? Or something else? Everything > I'm familiar with that is currently shipping is exposing a vanilla block > interface (conventional SSDs) that hides all of that or NVMe (which isn't > much better). > > If there is a low-level KV interface we can consume that would be > great--especially if we can glue it to our KeyValueDB abstract API. Even > so, we need to make sure that the object *data* also has an efficient API > we can utilize that efficiently handles block-sized/aligned data. If there's a way to efficiently utilize more generic NVRAM-based block devices for quick metadata ops such that payload data can fly without much delay, I'd be quite happy. Also, a current concern of mine is backups in some fashion of the metadata, given risk for (human configuration error||device malfunction)&&(cluster wide power outage). Some type of flushing to underlying consistent media, and/or snapshot-like backups. As long as the constructs aren't too exotic, perhaps this could be addressed using standard Linux FS or device mapper code (bcache, or other) Not sure how popular journals on NVRAM is. But here's one user at least. /M > sage > > > >Either way, I strongly support to have CEPH own data format instead > > of relying on filesystem. > > > > Regards, > > James > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 1:55 PM > > To: Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > > Sage, > > > I fully support that. If we want to saturate SSDs , we need to get > > > rid of this filesystem overhead (which I am in process of measuring). > > > Also, it will be good if we can eliminate the dependency on the k/v > > > dbs (for storing allocators and all). The reason is the unknown write > > > amps they causes. > > > > My hope is to keep behing the KeyValueDB interface (and/more change it as > > appropriate) so that other backends can be easily swapped in (e.g. a > > btree-based one for high-end flash). > > > > sage > > > > > > > > > > Thanks & Regards > > > Somnath > > > > > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > > Sent: Monday, October 19, 2015 12:49 PM > > > To: ceph-devel@vger.kernel.org > > > Subject: newstore direction > > > > > > The current design is based on two simple ideas: > > > > > > 1) a key/value interface is better way to manage all of our internal > > > metadata (object metadata, attrs, layout, collection membership, > > > write-ahead logging, overlay data, etc.) > > > > > > 2) a file system is well suited for storage object data (as files). > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > > few > > > things: > > > > > > - We currently write the data to the file, fsync, then commit the kv > > > transaction. That's at least 3 IOs: one for the data, one for the fs > > > journal, one for the kv txn to commit (at least once my rocksdb > > > changes land... the kv commit is currently 2-3). So two people are > > > managing metadata, here: the fs managing the file metadata (with its > > > own > > > journal) and the kv backend (with its journal). > > > > > > - On read we have to open files by name, which means traversing the fs > > > namespace. Newstore tries to keep it as flat and simple as possibl
Re: newstore direction
On Tue, Oct 20, 2015 at 12:44 PM, Sage Weilwrote: > On Tue, 20 Oct 2015, Ric Wheeler wrote: >> The big problem with consuming block devices directly is that you ultimately >> end up recreating most of the features that you had in the file system. Even >> enterprise databases like Oracle and DB2 have been migrating away from >> running >> on raw block devices in favor of file systems over time. In effect, you are >> looking at making a simple on disk file system which is always easier to >> start >> than it is to get back to a stable, production ready state. > > This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had > everything we were implementing and more: mainly, copy on write and data > checksums. But in practice the fact that its general purpose means it > targets a very different workloads and APIs than what we need. Try 7 years since ebofs... That's one of my concerns, though. You ditched ebofs once already because it had metastasized into an entire FS, and had reached its limits of maintainability. What makes you think a second time through would work better? :/ On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil wrote: > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). I can't work this one out. If you're doing one write for the data and one for the kv journal (which is on another filesystem), how does the commit sequence work that it's only 2 IOs instead of the same 3 we already have? Or are you planning to ditch the LevelDB/RocksDB store for our journaling and just use something within the block layer? If we do want to go down this road, we shouldn't need to write an allocator from scratch. I don't remember exactly which ones it is but we've read/seen at least a few storage papers where people have reused existing allocators — I think the one from ext2? And somebody managed to get it running in userspace. Of course, then we also need to figure out how to get checksums on the block data, since if we're going to put in the effort to reimplement this much of the stack we'd better get our full data integrity guarantees along with it! On Tue, Oct 20, 2015 at 1:00 PM, Sage Weil wrote: > On Tue, 20 Oct 2015, John Spray wrote: >> On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil wrote: >> > - We have to size the kv backend storage (probably still an XFS >> > partition) vs the block storage. Maybe we do this anyway (put metadata on >> > SSD!) so it won't matter. But what happens when we are storing gobs of >> > rgw index data or cephfs metadata? Suddenly we are pulling storage out of >> > a different pool and those aren't currently fungible. >> >> This is the concerning bit for me -- the other parts one "just" has to >> get the code right, but this problem could linger and be something we >> have to keep explaining to users indefinitely. It reminds me of cases >> in other systems where users had to make an educated guess about inode >> size up front, depending on whether you're expecting to efficiently >> store a lot of xattrs. >> >> In practice it's rare for users to make these kinds of decisions well >> up-front: it really needs to be adjustable later, ideally >> automatically. That could be pretty straightforward if the KV part >> was stored directly on block storage, instead of having XFS in the >> mix. I'm not quite up with the state of the art in this area: are >> there any reasonable alternatives for the KV part that would consume >> some defined range of a block device from userspace, instead of >> sitting on top of a filesystem? > > I agree: this is my primary concern with the raw block approach. > > There are some KV alternatives that could consume block, but the problem > would be similar: we need to dynamically size up or down the kv portion of > the device. > > I see two basic options: > > 1) Wire into the Env abstraction in rocksdb to provide something just > smart enough to let rocksdb work. It isn't much: named files (not that > many--we could easily keep the file table in ram), always written > sequentially, to be read later with random access. All of the code is > written around abstractions of SequentialFileWriter so that everything > posix is neatly hidden in env_posix (and there are various other env > implementations for in-memory mock tests etc.). This seems like the obviously correct move to me? Except we might want to include the rocksdb store on flash instead of hard drives, which means maybe we do want some unified storage system which can handle multiple physical storage devices as a single piece of storage space. (Not that any of those exist in "almost done" hell, or that we're going through requirements expansion or anything!) -Greg -- To unsubscribe from this list: send the
Re: newstore direction
On 10/20/2015 03:44 PM, Sage Weil wrote: On Tue, 20 Oct 2015, Ric Wheeler wrote: On 10/19/2015 03:49 PM, Sage Weil wrote: The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. Surely, yes, but the fact remains we are maintaining two journals: one internal to the fs that manages the allocation metadata, and one layered on top that handles the kv store's write stream. The lower bound on any write is 3 IOs (unless we're talking about a COW fs). The way storage devices work means that if we can batch these in some way, we might get 3 IO's that land in the cache (even for spinning drives) and one 1 that is followed by a cache flush. The first three IO's are quite quick, you don't need to write through to the platter. The cost is mostly in the fsync() call which waits until storage destages the cache to the platter. With SSD's, we have some different considerations. - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. I wish you luck convincing upstream to allow unprivileged access to open_by_handle or the XFS ioctl. :) But even if we had that, any object access requires multiple metadata lookups: one in our kv db, and a second to get the inode for the backing file. Again, there's an unnecessary lower bound on the number of IOs needed to access a cold object. We should dig into what this actually means when you can do open by handle. If you cache the inode (i.e., skip the directory traversal), you still need to figure out the mapping back to an actual block on the storage device. Not clear to me that you need more IO's with the file system doing this or by having a btree on disk - both will require IO. - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. Are you using O_DIRECT? Seems like there should be some enterprisey database tricks that we can use here. It's not about about the data path, but avoiding the useless bookkeeping the file system is doing that we don't want or need. See the recent recent reception of Zach's O_NOCMTIME patches on linux-fsdevel: http://marc.info/?t=14309496981=1=2 I'm generally an optimist when it comes to introducing new APIs upstream, but I still found this to be an unbelievingly frustrating exchange. We should talk more about this with the local FS people. Might be other ways to solve this. - XFS is (probably) never going going to give us data checksums, which we want desperately. What is the goal of having the file system do the checksums? How strong do they need to be and what size are the chunks? If you update this on each IO, this will certainly generate more IO (each write will possibly generate at least one other write to update that new checksum). Not if we keep the checksums with the allocation metadata, in the onode/inode, which we're also doing and IO to persist. But whther that is practial depends on the granularity (4KB or 16K or 128K or ...), which may in turn depend on the object (RBD block that'll service random 4K reads and writes? or RGW fragment that is always written sequentially?). I'm highly skeptical we'd ever get anything from a general-purpose file system that would work well here (if anything at all). XFS (or device mapper) could also store checksums per block. I think that the T10 DIF/DIX bits work for enterprise databases (again, bypassing the file system). Might be interesting to see if we could put the checksums into dm-thin. But what's the alternative? My thought is to just bite the bullet and consume a raw
Re: newstore direction
On Tue, 20 Oct 2015, Gregory Farnum wrote: > On Tue, Oct 20, 2015 at 12:44 PM, Sage Weilwrote: > > On Tue, 20 Oct 2015, Ric Wheeler wrote: > >> The big problem with consuming block devices directly is that you > >> ultimately > >> end up recreating most of the features that you had in the file system. > >> Even > >> enterprise databases like Oracle and DB2 have been migrating away from > >> running > >> on raw block devices in favor of file systems over time. In effect, you > >> are > >> looking at making a simple on disk file system which is always easier to > >> start > >> than it is to get back to a stable, production ready state. > > > > This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had > > everything we were implementing and more: mainly, copy on write and data > > checksums. But in practice the fact that its general purpose means it > > targets a very different workloads and APIs than what we need. > > Try 7 years since ebofs... Sigh... > That's one of my concerns, though. You ditched ebofs once already > because it had metastasized into an entire FS, and had reached its > limits of maintainability. What makes you think a second time through > would work better? :/ A fair point, and I've given this some thought: 1) We know a *lot* more about our workload than I did in 2005. The things I was worrying about then (fragmentation, mainly) are much easier to address now, where we have hints from rados and understand what the write patterns look like in practice (randomish 4k-128k ios for rbd, sequential writes for rgw, and the cephfs wildcard). 2) Most of the ebofs effort was around doing copy-on-write btrees (with checksums) and orchestrating commits. Here our job is *vastly* simplified by assuming the existence of a transactional key/value store. If you look at newstore today, we're already half-way through dealing with the complexity of doing allocations... we're essentially "allocating" blocks that are 1 MB files on XFS, managing that metadata, and overwriting or replacing those blocks on write/truncate/clone. By the time we add in an allocator (get_blocks(len), free_block(offset, len)) and rip out all the file handling fiddling (like fsync workqueues, file id allocator, file truncation fiddling, etc.) we'll probably have something working with about the same amount of code we have now. (Of course, that'll grow as we get more sophisticated, but that'll happen either way.) > On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil wrote: > > - 2 IOs for most: one to write the data to unused space in the block > > device, one to commit our transaction (vs 4+ before). For overwrites, > > we'd have one io to do our write-ahead log (kv journal), then do > > the overwrite async (vs 4+ before). > > I can't work this one out. If you're doing one write for the data and > one for the kv journal (which is on another filesystem), how does the > commit sequence work that it's only 2 IOs instead of the same 3 we > already have? Or are you planning to ditch the LevelDB/RocksDB store > for our journaling and just use something within the block layer? Now: 1 io to write a new file 1-2 ios to sync the fs journal (commit the inode, alloc change) (I see 2 journal IOs on XFS and only 1 on ext4...) 1 io to commit the rocksdb journal (currently 3, but will drop to 1 with xfs fix and my rocksdb change) With block: 1 io to write to block device 1 io to commit to rocksdb journal > If we do want to go down this road, we shouldn't need to write an > allocator from scratch. I don't remember exactly which ones it is but > we've read/seen at least a few storage papers where people have reused > existing allocators ? I think the one from ext2? And somebody managed > to get it running in userspace. Maybe, but the real win is when we combine the allocator state update with our kv transaction. Even if we adopt an existing algorithm we'll need to do some significant rejiggering to persist it in the kv store. My thought is start with something simple that works (e.g., linear sweep over free space, simple interval_set<>-style freelist) and once it works look at existing state of the art for a clever v2. BTW, I suspect a modest win here would be to simply use the collection/pg as a hint for storing related objects. That's the best indicator we have for aligned lifecycle (think PG migrations/deletions vs flash erase blocks). Good luck plumbing that through XFS... > Of course, then we also need to figure out how to get checksums on the > block data, since if we're going to put in the effort to reimplement > this much of the stack we'd better get our full data integrity > guarantees along with it! YES! Here I think we should make judicious use of the rados hints. For example, rgw always writes complete objects, so we can have coarse granularity crcs and only pay for very small reads (that have
Re: newstore direction
On Tue, 20 Oct 2015, John Spray wrote: > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weilwrote: > > - We have to size the kv backend storage (probably still an XFS > > partition) vs the block storage. Maybe we do this anyway (put metadata on > > SSD!) so it won't matter. But what happens when we are storing gobs of > > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > > a different pool and those aren't currently fungible. > > This is the concerning bit for me -- the other parts one "just" has to > get the code right, but this problem could linger and be something we > have to keep explaining to users indefinitely. It reminds me of cases > in other systems where users had to make an educated guess about inode > size up front, depending on whether you're expecting to efficiently > store a lot of xattrs. > > In practice it's rare for users to make these kinds of decisions well > up-front: it really needs to be adjustable later, ideally > automatically. That could be pretty straightforward if the KV part > was stored directly on block storage, instead of having XFS in the > mix. I'm not quite up with the state of the art in this area: are > there any reasonable alternatives for the KV part that would consume > some defined range of a block device from userspace, instead of > sitting on top of a filesystem? I agree: this is my primary concern with the raw block approach. There are some KV alternatives that could consume block, but the problem would be similar: we need to dynamically size up or down the kv portion of the device. I see two basic options: 1) Wire into the Env abstraction in rocksdb to provide something just smart enough to let rocksdb work. It isn't much: named files (not that many--we could easily keep the file table in ram), always written sequentially, to be read later with random access. All of the code is written around abstractions of SequentialFileWriter so that everything posix is neatly hidden in env_posix (and there are various other env implementations for in-memory mock tests etc.). 2) Use something like dm-thin to sit between the raw block device and XFS (for rocksdb) and the block device consumed by newstore. As long as XFS doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb files in their entirety) we can fstrim and size down the fs portion. If we similarly make newstores allocator stick to large blocks only we would be able to size down the block portion as well. Typical dm-thin block sizes seem to range from 64KB to 512KB, which seems reasonable enough to me. In fact, we could likely just size the fs volume at something conservatively large (like 90%) and rely on -o discard or periodic fstrim to keep its actual utilization in check. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Tue, Oct 20, 2015 at 11:31 AM, Ric Wheelerwrote: > On 10/19/2015 03:49 PM, Sage Weil wrote: >> >> The current design is based on two simple ideas: >> >> 1) a key/value interface is better way to manage all of our internal >> metadata (object metadata, attrs, layout, collection membership, >> write-ahead logging, overlay data, etc.) >> >> 2) a file system is well suited for storage object data (as files). >> >> So far 1 is working out well, but I'm questioning the wisdom of #2. A few >> things: >> >> - We currently write the data to the file, fsync, then commit the kv >> transaction. That's at least 3 IOs: one for the data, one for the fs >> journal, one for the kv txn to commit (at least once my rocksdb changes >> land... the kv commit is currently 2-3). So two people are managing >> metadata, here: the fs managing the file metadata (with its own >> journal) and the kv backend (with its journal). > > > If all of the fsync()'s fall into the same backing file system, are you sure > that each fsync() takes the same time? Depending on the local FS > implementation of course, but the order of issuing those fsync()'s can > effectively make some of them no-ops. > >> >> - On read we have to open files by name, which means traversing the fs >> namespace. Newstore tries to keep it as flat and simple as possible, but >> at a minimum it is a couple btree lookups. We'd love to use open by >> handle (which would reduce this to 1 btree traversal), but running >> the daemon as ceph and not root makes that hard... > > > This seems like a a pretty low hurdle to overcome. > >> >> - ...and file systems insist on updating mtime on writes, even when it >> is >> a overwrite with no allocation changes. (We don't care about mtime.) >> O_NOCMTIME patches exist but it is hard to get these past the kernel >> brainfreeze. > > > Are you using O_DIRECT? Seems like there should be some enterprisey database > tricks that we can use here. > >> >> - XFS is (probably) never going going to give us data checksums, which >> we >> want desperately. > > > What is the goal of having the file system do the checksums? How strong do > they need to be and what size are the chunks? > > If you update this on each IO, this will certainly generate more IO (each > write will possibly generate at least one other write to update that new > checksum). > >> >> But what's the alternative? My thought is to just bite the bullet and >> consume a raw block device directly. Write an allocator, hopefully keep >> it pretty simple, and manage it in kv store along with all of our other >> metadata. > > > The big problem with consuming block devices directly is that you ultimately > end up recreating most of the features that you had in the file system. Even > enterprise databases like Oracle and DB2 have been migrating away from > running on raw block devices in favor of file systems over time. In effect, > you are looking at making a simple on disk file system which is always > easier to start than it is to get back to a stable, production ready state. > > I think that it might be quicker and more maintainable to spend some time > working with the local file system people (XFS or other) to see if we can > jointly address the concerns you have. > >> >> Wins: >> >> - 2 IOs for most: one to write the data to unused space in the block >> device, one to commit our transaction (vs 4+ before). For overwrites, >> we'd have one io to do our write-ahead log (kv journal), then do >> the overwrite async (vs 4+ before). >> >> - No concern about mtime getting in the way >> >> - Faster reads (no fs lookup) >> >> - Similarly sized metadata for most objects. If we assume most objects >> are not fragmented, then the metadata to store the block offsets is about >> the same size as the metadata to store the filenames we have now. >> >> Problems: >> >> - We have to size the kv backend storage (probably still an XFS >> partition) vs the block storage. Maybe we do this anyway (put metadata on >> SSD!) so it won't matter. But what happens when we are storing gobs of >> rgw index data or cephfs metadata? Suddenly we are pulling storage out of >> a different pool and those aren't currently fungible. >> >> - We have to write and maintain an allocator. I'm still optimistic this >> can be reasonbly simple, especially for the flash case (where >> fragmentation isn't such an issue as long as our blocks are reasonbly >> sized). For disk we may beed to be moderately clever. >> >> - We'll need a fsck to ensure our internal metadata is consistent. The >> good news is it'll just need to validate what we have stored in the kv >> store. >> >> Other thoughts: >> >> - We might want to consider whether dm-thin or bcache or other block >> layers might help us with elasticity of file vs block areas. >> >> - Rocksdb can push colder data to a second directory, so we could have a >> fast ssd primary area (for wal and most metadata) and a
RE: newstore direction
Varada, Hopefully , It will answer yours question too. It is going to be new type of key value device than traditional hard drive based OSD device. It will have its own storage stack than traditional block based storage stack. I have to admit it is a little bit more aggressive than block based approach . Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI Sent: Tuesday, October 20, 2015 1:33 PM To: Sage Weil Cc: Somnath Roy; ceph-devel@vger.kernel.org Subject: RE: newstore direction Hi Sage, Sorry for confusing you. SSDs with key value interfaces are still under development by several vendors. It has totally different design approach than Open Channel SSD. I met Matias several months ago and discussed about possibilities to have key value interface support with Open Channel SSD . I am not following the progress since then. If Matias is in this group, He will definitely can give us better explanations. Here is his presentation for key value support with open channel SSD for your reference. http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf Regards, James -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, October 20, 2015 5:34 AM To: James (Fei) Liu-SSI Cc: Somnath Roy; ceph-devel@vger.kernel.org Subject: RE: newstore direction On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote: > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive solution than > raw block device base keyvalue store as backend for objectstore. The > new key value SSD device with transaction support would be ideal to > solve the issues. First of all, it is raw SSD device. Secondly , It > provides key value interface directly from SSD. Thirdly, it can > provide transaction support, consistency will be guaranteed by hardware > device. > It pretty much satisfied all of objectstore needs without any extra > overhead since there is not any extra layer in between device and > objectstore. Are you talking about open channel SSDs? Or something else? Everything I'm familiar with that is currently shipping is exposing a vanilla block interface (conventional SSDs) that hides all of that or NVMe (which isn't much better). If there is a low-level KV interface we can consume that would be great--especially if we can glue it to our KeyValueDB abstract API. Even so, we need to make sure that the object *data* also has an efficient API we can utilize that efficiently handles block-sized/aligned data. sage >Either way, I strongly support to have CEPH own data format instead > of relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > Sage, > > I fully support that. If we want to saturate SSDs , we need to get > > rid of this filesystem overhead (which I am in process of measuring). > > Also, it will be good if we can eliminate the dependency on the k/v > > dbs (for storing allocators and all). The reason is the unknown > > write amps they causes. > > My hope is to keep behing the KeyValueDB interface (and/more change it > as > appropriate) so that other backends can be easily swapped in (e.g. a > btree-based one for high-end flash). > > sage > > > > > > Thanks & Regards > > Somnath > > > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 12:49 PM > > To: ceph-devel@vger.kernel.org > > Subject: newstore direction > > > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our > > internal metadata (object metadata, attrs, layout, collection > > membership, write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. > > A few > > things: > > > > - We currently write the data to the file, fsync, then commit the > > kv transaction. That's at least 3 IOs: one for the data, one for > > the fs journal, one for the kv txn to commit (at least once my > > rocksdb changes land... the kv commit is currently 2-3). So two > > people are managing metadata
RE: newstore direction
On Tue, 20 Oct 2015, James (Fei) Liu-SSI wrote: > Hi Sage, >Sorry for confusing you. SSDs with key value interfaces are still > under development by several vendors. It has totally different design > approach than Open Channel SSD. I met Matias several months ago and > discussed about possibilities to have key value interface support with > Open Channel SSD . I am not following the progress since then. If Matias > is in this group, He will definitely can give us better explanations. > Here is his presentation for key value support with open channel SSD for > your reference. > > http://events.linuxfoundation.org/sites/events/files/slides/LightNVM-Vault2015.pdf Ok cool. I saw Matias' talk at Vault and was very pleased to see that there is some real effort to get away from black box FTLs. And I am eagerly awaiting the arrival of SSDs with a kv interface... open channel especially, but even proprietary devices exposing kv would be an improvement over proprietary devices exposing block. :) sage > > > Regards, > James > > -Original Message- > From: Sage Weil [mailto:sw...@redhat.com] > Sent: Tuesday, October 20, 2015 5:34 AM > To: James (Fei) Liu-SSI > Cc: Somnath Roy; ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, James (Fei) Liu-SSI wrote: > > Hi Sage and Somnath, > > In my humble opinion, There is another more aggressive solution than > > raw block device base keyvalue store as backend for objectstore. The > > new key value SSD device with transaction support would be ideal to > > solve the issues. First of all, it is raw SSD device. Secondly , It > > provides key value interface directly from SSD. Thirdly, it can > > provide transaction support, consistency will be guaranteed by hardware > > device. > > It pretty much satisfied all of objectstore needs without any extra > > overhead since there is not any extra layer in between device and > > objectstore. > > Are you talking about open channel SSDs? Or something else? Everything I'm > familiar with that is currently shipping is exposing a vanilla block > interface (conventional SSDs) that hides all of that or NVMe (which isn't > much better). > > If there is a low-level KV interface we can consume that would be > great--especially if we can glue it to our KeyValueDB abstract API. Even so, > we need to make sure that the object *data* also has an efficient API we can > utilize that efficiently handles block-sized/aligned data. > > sage > > > >Either way, I strongly support to have CEPH own data format instead > > of relying on filesystem. > > > > Regards, > > James > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 1:55 PM > > To: Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > > Sage, > > > I fully support that. If we want to saturate SSDs , we need to get > > > rid of this filesystem overhead (which I am in process of measuring). > > > Also, it will be good if we can eliminate the dependency on the k/v > > > dbs (for storing allocators and all). The reason is the unknown > > > write amps they causes. > > > > My hope is to keep behing the KeyValueDB interface (and/more change it > > as > > appropriate) so that other backends can be easily swapped in (e.g. a > > btree-based one for high-end flash). > > > > sage > > > > > > > > > > Thanks & Regards > > > Somnath > > > > > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > > Sent: Monday, October 19, 2015 12:49 PM > > > To: ceph-devel@vger.kernel.org > > > Subject: newstore direction > > > > > > The current design is based on two simple ideas: > > > > > > 1) a key/value interface is better way to manage all of our > > > internal metadata (object metadata, attrs, layout, collection > > > membership, write-ahead logging, overlay data, etc.) > > > > > > 2) a file system is well suited for storage object data (as files). > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. > > > A few > > > things: > > > > > > - We curren
Re: newstore direction
We mostly assumed that sort-of transactional file systems, perhaps hosted in user space was the most tractable trajectory. I have seen newstore and keyvalue store as essentially congruent approaches using database primitives (and I am interested in what you make of Russell Sears). I'm skeptical of any hope of keeping things "simple." Like Martin downthread, most systems I havce seen (filers, ZFS)) make use of a fast, durable commit log and then flex out...something else. -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-707-0660 fax. 734-769-8938 cel. 734-216-5309 - Original Message - > From: "Sage Weil" <sw...@redhat.com> > To: "John Spray" <jsp...@redhat.com> > Cc: "Ceph Development" <ceph-devel@vger.kernel.org> > Sent: Tuesday, October 20, 2015 4:00:23 PM > Subject: Re: newstore direction > > On Tue, 20 Oct 2015, John Spray wrote: > > On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil <sw...@redhat.com> wrote: > > > - We have to size the kv backend storage (probably still an XFS > > > partition) vs the block storage. Maybe we do this anyway (put metadata > > > on > > > SSD!) so it won't matter. But what happens when we are storing gobs of > > > rgw index data or cephfs metadata? Suddenly we are pulling storage out > > > of > > > a different pool and those aren't currently fungible. > > > > This is the concerning bit for me -- the other parts one "just" has to > > get the code right, but this problem could linger and be something we > > have to keep explaining to users indefinitely. It reminds me of cases > > in other systems where users had to make an educated guess about inode > > size up front, depending on whether you're expecting to efficiently > > store a lot of xattrs. > > > > In practice it's rare for users to make these kinds of decisions well > > up-front: it really needs to be adjustable later, ideally > > automatically. That could be pretty straightforward if the KV part > > was stored directly on block storage, instead of having XFS in the > > mix. I'm not quite up with the state of the art in this area: are > > there any reasonable alternatives for the KV part that would consume > > some defined range of a block device from userspace, instead of > > sitting on top of a filesystem? > > I agree: this is my primary concern with the raw block approach. > > There are some KV alternatives that could consume block, but the problem > would be similar: we need to dynamically size up or down the kv portion of > the device. > > I see two basic options: > > 1) Wire into the Env abstraction in rocksdb to provide something just > smart enough to let rocksdb work. It isn't much: named files (not that > many--we could easily keep the file table in ram), always written > sequentially, to be read later with random access. All of the code is > written around abstractions of SequentialFileWriter so that everything > posix is neatly hidden in env_posix (and there are various other env > implementations for in-memory mock tests etc.). > > 2) Use something like dm-thin to sit between the raw block device and XFS > (for rocksdb) and the block device consumed by newstore. As long as XFS > doesn't fragment horrifically (it shouldn't, given we *always* write ~4mb > files in their entirety) we can fstrim and size down the fs portion. If > we similarly make newstores allocator stick to large blocks only we would > be able to size down the block portion as well. Typical dm-thin block > sizes seem to range from 64KB to 512KB, which seems reasonable enough to > me. In fact, we could likely just size the fs volume at something > conservatively large (like 90%) and rely on -o discard or periodic fstrim > to keep its actual utilization in check. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On 10/20/2015 05:47 PM, Sage Weil wrote: On Tue, 20 Oct 2015, Gregory Farnum wrote: On Tue, Oct 20, 2015 at 12:44 PM, Sage Weilwrote: On Tue, 20 Oct 2015, Ric Wheeler wrote: The big problem with consuming block devices directly is that you ultimately end up recreating most of the features that you had in the file system. Even enterprise databases like Oracle and DB2 have been migrating away from running on raw block devices in favor of file systems over time. In effect, you are looking at making a simple on disk file system which is always easier to start than it is to get back to a stable, production ready state. This was why we abandoned ebofs ~4 years ago... btrfs had arrived and had everything we were implementing and more: mainly, copy on write and data checksums. But in practice the fact that its general purpose means it targets a very different workloads and APIs than what we need. Try 7 years since ebofs... Sigh... That's one of my concerns, though. You ditched ebofs once already because it had metastasized into an entire FS, and had reached its limits of maintainability. What makes you think a second time through would work better? :/ A fair point, and I've given this some thought: 1) We know a *lot* more about our workload than I did in 2005. The things I was worrying about then (fragmentation, mainly) are much easier to address now, where we have hints from rados and understand what the write patterns look like in practice (randomish 4k-128k ios for rbd, sequential writes for rgw, and the cephfs wildcard). 2) Most of the ebofs effort was around doing copy-on-write btrees (with checksums) and orchestrating commits. Here our job is *vastly* simplified by assuming the existence of a transactional key/value store. If you look at newstore today, we're already half-way through dealing with the complexity of doing allocations... we're essentially "allocating" blocks that are 1 MB files on XFS, managing that metadata, and overwriting or replacing those blocks on write/truncate/clone. By the time we add in an allocator (get_blocks(len), free_block(offset, len)) and rip out all the file handling fiddling (like fsync workqueues, file id allocator, file truncation fiddling, etc.) we'll probably have something working with about the same amount of code we have now. (Of course, that'll grow as we get more sophisticated, but that'll happen either way.) On Mon, Oct 19, 2015 at 12:49 PM, Sage Weil wrote: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). I can't work this one out. If you're doing one write for the data and one for the kv journal (which is on another filesystem), how does the commit sequence work that it's only 2 IOs instead of the same 3 we already have? Or are you planning to ditch the LevelDB/RocksDB store for our journaling and just use something within the block layer? Now: 1 io to write a new file 1-2 ios to sync the fs journal (commit the inode, alloc change) (I see 2 journal IOs on XFS and only 1 on ext4...) 1 io to commit the rocksdb journal (currently 3, but will drop to 1 with xfs fix and my rocksdb change) I think that might be too pessimistic - the number of discrete IO's sent down to a spinning disk make much less impact on performance than the number of fsync()'s since they IO's all land in the write cache. Some newer spinning drives have a non-volatile write cache, so even an fsync() might not end up doing the expensive data transfer to the platter. It would be interesting to get the timings on the IO's you see to measure the actual impact. With block: 1 io to write to block device 1 io to commit to rocksdb journal If we do want to go down this road, we shouldn't need to write an allocator from scratch. I don't remember exactly which ones it is but we've read/seen at least a few storage papers where people have reused existing allocators ? I think the one from ext2? And somebody managed to get it running in userspace. Maybe, but the real win is when we combine the allocator state update with our kv transaction. Even if we adopt an existing algorithm we'll need to do some significant rejiggering to persist it in the kv store. My thought is start with something simple that works (e.g., linear sweep over free space, simple interval_set<>-style freelist) and once it works look at existing state of the art for a clever v2. BTW, I suspect a modest win here would be to simply use the collection/pg as a hint for storing related objects. That's the best indicator we have for aligned lifecycle (think PG migrations/deletions vs flash erase blocks). Good luck plumbing that through XFS... Of course, then we also need to figure out how to get checksums on the block data,
Re: newstore direction
On 10/20/2015 07:30 AM, Sage Weil wrote: On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: +1, nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still under development. Do you have an NVMKV link? I see a paper and a stale github repo.. not sure if I'm looking at the right thing. My concern with using a key/value interface for the object data is that you end up with lots of key/value pairs (e.g., $inode_$offset = $4kb_of_data) that is pretty inefficient to store and (depending on the implementation) tends to break alignment. I don't think these interfaces are targetted toward block-sized/aligned payloads. Storing just the metadata (block allocation map) w/ the kv api and storing the data directly on a block/page interface makes more sense to me. sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI Sent: Tuesday, October 20, 2015 6:21 AM To: Sage Weil; Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction Hi Sage and Somnath, In my humble opinion, There is another more aggressive solution than raw block device base keyvalue store as backend for objectstore. The new key value SSD device with transaction support would be ideal to solve the issues. First of all, it is raw SSD device. Secondly , It provides key value interface directly from SSD. Thirdly, it can provide transaction support, consistency will be guaranteed by hardware device. It pretty much satisfied all of objectstore needs without any extra overhead since there is not any extra layer in between device and objectstore. Either way, I strongly support to have CEPH own data format instead of relying on filesystem. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 1:55 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction On Mon, 19 Oct 2015, Somnath Roy wrote: Sage, I fully support that. If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring). Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree- based one for high-end flash). sage Thanks & Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 12:49 PM To: ceph-devel@vger.kernel.org Subject: newstore direction The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. - XFS is (probably) never going going to give us data checksums, which we want desperately. But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. Wins: - 2 IOs for most: one to write the
RE: newstore direction
Sage, I fully support that. If we want to saturate SSDs , we need to get rid of this filesystem overhead (which I am in process of measuring). Also, it will be good if we can eliminate the dependency on the k/v dbs (for storing allocators and all). The reason is the unknown write amps they causes. Thanks & Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 12:49 PM To: ceph-devel@vger.kernel.org Subject: newstore direction The current design is based on two simple ideas: 1) a key/value interface is better way to manage all of our internal metadata (object metadata, attrs, layout, collection membership, write-ahead logging, overlay data, etc.) 2) a file system is well suited for storage object data (as files). So far 1 is working out well, but I'm questioning the wisdom of #2. A few things: - We currently write the data to the file, fsync, then commit the kv transaction. That's at least 3 IOs: one for the data, one for the fs journal, one for the kv txn to commit (at least once my rocksdb changes land... the kv commit is currently 2-3). So two people are managing metadata, here: the fs managing the file metadata (with its own journal) and the kv backend (with its journal). - On read we have to open files by name, which means traversing the fs namespace. Newstore tries to keep it as flat and simple as possible, but at a minimum it is a couple btree lookups. We'd love to use open by handle (which would reduce this to 1 btree traversal), but running the daemon as ceph and not root makes that hard... - ...and file systems insist on updating mtime on writes, even when it is a overwrite with no allocation changes. (We don't care about mtime.) O_NOCMTIME patches exist but it is hard to get these past the kernel brainfreeze. - XFS is (probably) never going going to give us data checksums, which we want desperately. But what's the alternative? My thought is to just bite the bullet and consume a raw block device directly. Write an allocator, hopefully keep it pretty simple, and manage it in kv store along with all of our other metadata. Wins: - 2 IOs for most: one to write the data to unused space in the block device, one to commit our transaction (vs 4+ before). For overwrites, we'd have one io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ before). - No concern about mtime getting in the way - Faster reads (no fs lookup) - Similarly sized metadata for most objects. If we assume most objects are not fragmented, then the metadata to store the block offsets is about the same size as the metadata to store the filenames we have now. Problems: - We have to size the kv backend storage (probably still an XFS partition) vs the block storage. Maybe we do this anyway (put metadata on SSD!) so it won't matter. But what happens when we are storing gobs of rgw index data or cephfs metadata? Suddenly we are pulling storage out of a different pool and those aren't currently fungible. - We have to write and maintain an allocator. I'm still optimistic this can be reasonbly simple, especially for the flash case (where fragmentation isn't such an issue as long as our blocks are reasonbly sized). For disk we may beed to be moderately clever. - We'll need a fsck to ensure our internal metadata is consistent. The good news is it'll just need to validate what we have stored in the kv store. Other thoughts: - We might want to consider whether dm-thin or bcache or other block layers might help us with elasticity of file vs block areas. - Rocksdb can push colder data to a second directory, so we could have a fast ssd primary area (for wal and most metadata) and a second hdd directory for stuff it has to push off. Then have a conservative amount of file space on the hdd. If our block fills up, use the existing file mechanism to put data there too. (But then we have to maintain both the current kv + file approach and not go all-in on kv + block.) Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored
Re: newstore direction
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I think there is a lot that can be gained by Ceph managing a raw block device. As I mentioned on ceph-users, I've given this some though and a lot of optimizations could be done that is conducive to storing objects. I didn't think however to bypass VFS all together by opening the raw device directly, but this would make things simpler as you don't have to program things for VFS that don't make sense. Some of my thoughts were to employ a hashing algorithm for inode lookup (CRUSH like). Is there a good use case for listing a directory? We may need to keep a list for deletion, but there may be a better way to handle this. Is there a need to do snapshots at the block layer if operations can be atomic? Is there a real advantage to have an allocation as small as 4K, or does it make since to use something like 512K? I'm interested in how this might pan out. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWJVEACRDmVDuy+mK58QAAIQEQAK9GUmGQBP1wYa9yXNEp juofzj5SCxxiNCBdY3kkdHXELCWkLGn331JX2El8h1lPaqH8/nWNy4U6hx0s 7A5EBgQp7+LN03OLroSfiSccPhEe5B/OB1cnyZjmxwDXyaMJzqXwn231f5ev lBEzvU5PpHrMdNIIGxNFEHgduxfPIw5ciOokP27Tle1JdAGSn6fL6nRLtQfd HmVLnnXJT9zaGRyxnL8ZQU8IlfjfhMpIc1bM3QKkQkBmXanzCaNaULrlO35L XtIy0fEXAjkcGHpxOTz4yx5OFKwkpirFduU2PBn+5kqxPRvGL/eEzIxTV89c SfhAkyBFpl+g7G+q532i7L/34r2wXOL7wcn9seLdOZIt1LVnb059r0tpy4Fz X/V2/ao1Fua2BFMYzMskPXiKFzxLu/jOS12CjvYWkNhN4C2pGUbRxhqYnC0k gjRpoOZHDr+RogQdlzXeUmcbZzvtwWqk2uECIX2mLR1aHTVgnpegJhvvHdl3 Nm7jxLyTof2bcXQgSwO5YEXvWO3dNfQynrb5zE+aIVM5ps9D95Mmm94lJtda 47zraQNwrL1OVS7Fd4ot9VepLcQ4orCUZPSqrm5FBlBWj5G+/U0F8VQl8u/g /nSZrxMXjHJWRhFvzFMYC3yUp59N75LXR5wId8RkAkgZVM+PftB4LmB7spHC WcGR =j3i1 -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Oct 19, 2015 at 1:49 PM, Sage Weilwrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate
RE: newstore direction
On Mon, 19 Oct 2015, Somnath Roy wrote: > Sage, > I fully support that. If we want to saturate SSDs , we need to get rid > of this filesystem overhead (which I am in process of measuring). Also, > it will be good if we can eliminate the dependency on the k/v dbs (for > storing allocators and all). The reason is the unknown write amps they > causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash). sage > > Thanks & Regards > Somnath > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 12:49 PM > To: ceph-devel@vger.kernel.org > Subject: newstore direction > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, write-ahead > logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at > a minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of rgw > index data or cephfs metadata? Suddenly we are pulling storage out of a > different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this can > be reasonbly simple, especially for the flash case (where fragmentation isn't > such an issue as long as our blocks are reasonbly sized). For disk we may > beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The good > news is it'll just need to validate what we have stored in the kv store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block layers > might help us with elasticity of file vs block areas. > > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd directory > for stuff it has to push off. Then have a conservative amount of file space > on the hdd. If our block fills up, use the existing file mechanism to put > data there too. (But then we have to maintain both the current kv + file > approach and not go all-in on kv + block.) > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this
Re: newstore direction
On 10/19/2015 09:49 PM, Sage Weil wrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate what we have stored in the kv > store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers might help us with elasticity of file vs block areas. > I've been using bcache for a while now in production and that helped a lot. Intel SSDs with GPT. First few partitions as Journals and then one big partition for bcache. /dev/bcache02.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-60 /dev/bcache12.8T 317G 2.5T 12% /var/lib/ceph/osd/ceph-61 /dev/bcache22.8T 303G 2.5T 11% /var/lib/ceph/osd/ceph-62 /dev/bcache32.8T 316G 2.5T 12% /var/lib/ceph/osd/ceph-63 /dev/bcache42.8T 167G 2.6T 6% /var/lib/ceph/osd/ceph-64 /dev/bcache52.8T 295G 2.5T 11% /var/lib/ceph/osd/ceph-65 The maintainers from bcache also presented bcachefs: https://lkml.org/lkml/2015/8/21/22 "checksumming, compression: currently only zlib is supported for compression, and for checksumming there's crc32c and a 64 bit checksum." Wouldn't that be something that can be leveraged from? Consuming a raw block device seems like re-inventing the wheel to me. I might be wrong though. I have no idea how stable bcachefs is, but it might be worth looking in to. > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd > directory for stuff it has to push off. Then have a conservative amount > of file space on the hdd. If our block fills up, use the existing file > mechanism to put data there too. (But then we have to maintain both the > current kv + file approach and not go all-in on kv + block.) > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at
RE: newstore direction
Hi Sage, If we are managing the raw device, does it make sense to have a key value store to manage the whole space? Having metadata of the allocator might cause some other problems of consistency. Getting an fsck for that implementation can be tougher, we might have to have strict crc computations on the data. And have to manage sanity of the DB managing them. If we can have a common mechanism of having data and metadata the same keyvalue store, will improve the performance. We have integrated a custom made key value store which works on raw device the key value store backend. And we have observed better bw utilization and iops. Read/writes can be faster and no fslookup needed. We have tools like fsck to care of consistency of DB. Couple of comments inline. Thanks, Varada > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Tuesday, October 20, 2015 1:19 AM > To: ceph-devel@vger.kernel.org > Subject: newstore direction > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal metadata > (object metadata, attrs, layout, collection membership, write-ahead logging, > overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one > for the kv txn to commit (at least once my rocksdb changes land... the kv > commit is currently 2-3). So two people are managing metadata, here: the fs > managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at a > minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and consume > a raw block device directly. Write an allocator, hopefully keep it pretty > simple, and manage it in kv store along with all of our other metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of rgw > index data or cephfs metadata? Suddenly we are pulling storage out of a > different pool and those aren't currently fungible. [Varada Kari] Ideally if we can manage the raw device as key value store indirection to manage metadata and data both, we can benefit with faster lookups and writes (if the KVStore supports a batch atomic transactional write). SSD's might suffer with more write amplification by putting the meta data alone, if we can manage this part(KV Store to deal with raw device) also(handling small writes) we can avoid write amplification and get better throughput from the device. > - We have to write and maintain an allocator. I'm still optimistic this can > be > reasonbly simple, especially for the flash case (where fragmentation isn't > such an issue as long as our blocks are reasonbly sized). For disk we may > beed to be moderately clever. > [Varada Kari] Yes. If the writes are aligned to flash programmable page size, that will not cause any issues. But writes less than programmable page size will cause internal fragmentation. Repeated overwrites to the same, will cause more write amplification. > - We'll need a fsck to ensure our internal metadata is consistent. The good > news is it'll just need to validate what we have stored in the kv store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers
RE: newstore direction
Hi Sage and Somnath, In my humble opinion, There is another more aggressive solution than raw block device base keyvalue store as backend for objectstore. The new key value SSD device with transaction support would be ideal to solve the issues. First of all, it is raw SSD device. Secondly , It provides key value interface directly from SSD. Thirdly, it can provide transaction support, consistency will be guaranteed by hardware device. It pretty much satisfied all of objectstore needs without any extra overhead since there is not any extra layer in between device and objectstore. Either way, I strongly support to have CEPH own data format instead of relying on filesystem. Regards, James -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, October 19, 2015 1:55 PM To: Somnath Roy Cc: ceph-devel@vger.kernel.org Subject: RE: newstore direction On Mon, 19 Oct 2015, Somnath Roy wrote: > Sage, > I fully support that. If we want to saturate SSDs , we need to get > rid of this filesystem overhead (which I am in process of measuring). > Also, it will be good if we can eliminate the dependency on the k/v > dbs (for storing allocators and all). The reason is the unknown write > amps they causes. My hope is to keep behing the KeyValueDB interface (and/more change it as appropriate) so that other backends can be easily swapped in (e.g. a btree-based one for high-end flash). sage > > Thanks & Regards > Somnath > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 12:49 PM > To: ceph-devel@vger.kernel.org > Subject: newstore direction > > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb > changes land... the kv commit is currently 2-3). So two people are > managing metadata, here: the fs managing the file metadata (with its > own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at > a minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs 4+ > before). > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects are > not fragmented, then the metadata to store the block offsets is about the > same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put > metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of rgw > index data or cephfs metadata? Suddenly we are pulling storage out of a > different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this can > be reasonbly simple, especially for the flash case (where fragmentation isn't > such an issue as long as our blocks are reasonbly sized). For disk we may
Re: newstore direction
On Mon, Oct 19, 2015 at 8:49 PM, Sage Weilwrote: > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. This is the concerning bit for me -- the other parts one "just" has to get the code right, but this problem could linger and be something we have to keep explaining to users indefinitely. It reminds me of cases in other systems where users had to make an educated guess about inode size up front, depending on whether you're expecting to efficiently store a lot of xattrs. In practice it's rare for users to make these kinds of decisions well up-front: it really needs to be adjustable later, ideally automatically. That could be pretty straightforward if the KV part was stored directly on block storage, instead of having XFS in the mix. I'm not quite up with the state of the art in this area: are there any reasonable alternatives for the KV part that would consume some defined range of a block device from userspace, instead of sitting on top of a filesystem? John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
+1, nowadays K-V DB care more about very small key-value pairs, say several bytes to a few KB, but in SSD case we only care about 4KB or 8KB. In this way, NVMKV is a good design and seems some of the SSD vendor are also trying to build this kind of interface, we had a NVM-L library but still under development. > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > Sent: Tuesday, October 20, 2015 6:21 AM > To: Sage Weil; Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive solution than raw > block device base keyvalue store as backend for objectstore. The new key > value SSD device with transaction support would be ideal to solve the > issues. > First of all, it is raw SSD device. Secondly , It provides key value interface > directly from SSD. Thirdly, it can provide transaction support, consistency > will > be guaranteed by hardware device. It pretty much satisfied all of objectstore > needs without any extra overhead since there is not any extra layer in > between device and objectstore. >Either way, I strongly support to have CEPH own data format instead of > relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > Sage, > > I fully support that. If we want to saturate SSDs , we need to get > > rid of this filesystem overhead (which I am in process of measuring). > > Also, it will be good if we can eliminate the dependency on the k/v > > dbs (for storing allocators and all). The reason is the unknown write > > amps they causes. > > My hope is to keep behing the KeyValueDB interface (and/more change it as > appropriate) so that other backends can be easily swapped in (e.g. a btree- > based one for high-end flash). > > sage > > > > > > Thanks & Regards > > Somnath > > > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 12:49 PM > > To: ceph-devel@vger.kernel.org > > Subject: newstore direction > > > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb > > changes land... the kv commit is currently 2-3). So two people are > > managing metadata, here: the fs managing the file metadata (with its > > own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at a > minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have one > io to do our write-ahead log (kv journal), then do the overwrite async (vs
RE: newstore direction
There is something like : http://pmem.io/nvml/libpmemobj/ to adapt NVMe to transactional object storage. But definitely need some more works > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent: Tuesday, October 20, 2015 10:33 AM > To: James (Fei) Liu-SSI; Sage Weil; Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi James, > > Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? > If SCSI OSD is what you are mentioning, drive has to support all osd > functionality mentioned by T10. > If not, we have to implement the same functionality in kernel or have a > wrapper in user space to convert them to read/write calls. This seems more > effort. > > Varada > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > > Sent: Tuesday, October 20, 2015 3:51 AM > > To: Sage Weil <sw...@redhat.com>; Somnath Roy > > <somnath@sandisk.com> > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > Hi Sage and Somnath, > > In my humble opinion, There is another more aggressive solution > > than raw block device base keyvalue store as backend for objectstore. > > The new key value SSD device with transaction support would be ideal > > to solve the issues. First of all, it is raw SSD device. Secondly , It > > provides key value interface directly from SSD. Thirdly, it can > > provide transaction support, consistency will be guaranteed by > > hardware device. It pretty much satisfied all of objectstore needs > > without any extra overhead since there is not any extra layer in between > device and objectstore. > >Either way, I strongly support to have CEPH own data format instead > > of relying on filesystem. > > > > Regards, > > James > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > > ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 1:55 PM > > To: Somnath Roy > > Cc: ceph-devel@vger.kernel.org > > Subject: RE: newstore direction > > > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > > Sage, > > > I fully support that. If we want to saturate SSDs , we need to get > > > rid of this filesystem overhead (which I am in process of measuring). > > > Also, it will be good if we can eliminate the dependency on the k/v > > > dbs (for storing allocators and all). The reason is the unknown > > > write amps they causes. > > > > My hope is to keep behing the KeyValueDB interface (and/more change it > > as > > appropriate) so that other backends can be easily swapped in (e.g. a > > btree- based one for high-end flash). > > > > sage > > > > > > > > > > Thanks & Regards > > > Somnath > > > > > > > > > -Original Message- > > > From: ceph-devel-ow...@vger.kernel.org > > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > > Sent: Monday, October 19, 2015 12:49 PM > > > To: ceph-devel@vger.kernel.org > > > Subject: newstore direction > > > > > > The current design is based on two simple ideas: > > > > > > 1) a key/value interface is better way to manage all of our > > > internal metadata (object metadata, attrs, layout, collection > > > membership, write-ahead logging, overlay data, etc.) > > > > > > 2) a file system is well suited for storage object data (as files). > > > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. > > > A few > > > things: > > > > > > - We currently write the data to the file, fsync, then commit the > > > kv transaction. That's at least 3 IOs: one for the data, one for > > > the fs journal, one for the kv txn to commit (at least once my > > > rocksdb changes land... the kv commit is currently 2-3). So two > > > people are managing metadata, here: the fs managing the file > > > metadata (with its own > > > journal) and the kv backend (with its journal). > > > > > > - On read we have to open files by name, which means traversing the > > > fs > > namespace. Newstore tries to keep it as flat and simple as possible, > > but at a minimum it is a couple btree lookups. We'd love to use open > > by handle
RE: newstore direction
Hi James, Are you mentioning SCSI OSD (http://www.t10.org/drafts.htm#OSD_Family) ? If SCSI OSD is what you are mentioning, drive has to support all osd functionality mentioned by T10. If not, we have to implement the same functionality in kernel or have a wrapper in user space to convert them to read/write calls. This seems more effort. Varada > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI > Sent: Tuesday, October 20, 2015 3:51 AM > To: Sage Weil <sw...@redhat.com>; Somnath Roy > <somnath@sandisk.com> > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > Hi Sage and Somnath, > In my humble opinion, There is another more aggressive solution than raw > block device base keyvalue store as backend for objectstore. The new key > value SSD device with transaction support would be ideal to solve the > issues. First of all, it is raw SSD device. Secondly , It provides key value > interface directly from SSD. Thirdly, it can provide transaction support, > consistency will be guaranteed by hardware device. It pretty much satisfied > all of objectstore needs without any extra overhead since there is not any > extra layer in between device and objectstore. >Either way, I strongly support to have CEPH own data format instead of > relying on filesystem. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Monday, October 19, 2015 1:55 PM > To: Somnath Roy > Cc: ceph-devel@vger.kernel.org > Subject: RE: newstore direction > > On Mon, 19 Oct 2015, Somnath Roy wrote: > > Sage, > > I fully support that. If we want to saturate SSDs , we need to get > > rid of this filesystem overhead (which I am in process of measuring). > > Also, it will be good if we can eliminate the dependency on the k/v > > dbs (for storing allocators and all). The reason is the unknown write > > amps they causes. > > My hope is to keep behing the KeyValueDB interface (and/more change it as > appropriate) so that other backends can be easily swapped in (e.g. a btree- > based one for high-end flash). > > sage > > > > > > Thanks & Regards > > Somnath > > > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Monday, October 19, 2015 12:49 PM > > To: ceph-devel@vger.kernel.org > > Subject: newstore direction > > > > The current design is based on two simple ideas: > > > > 1) a key/value interface is better way to manage all of our internal > > metadata (object metadata, attrs, layout, collection membership, > > write-ahead logging, overlay data, etc.) > > > > 2) a file system is well suited for storage object data (as files). > > > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > > few > > things: > > > > - We currently write the data to the file, fsync, then commit the kv > > transaction. That's at least 3 IOs: one for the data, one for the fs > > journal, one for the kv txn to commit (at least once my rocksdb > > changes land... the kv commit is currently 2-3). So two people are > > managing metadata, here: the fs managing the file metadata (with its > > own > > journal) and the kv backend (with its journal). > > > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but at a > minimum it is a couple btree lookups. We'd love to use open by handle > (which would reduce this to 1 btree traversal), but running the daemon as > ceph and not root makes that hard... > > > > - ...and file systems insist on updating mtime on writes, even when it is a > overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep it > pretty simple, and manage it in kv store along with all of our other metadata. > > > > Wins: > > > > - 2 IOs for most: one to write the data to unused space in the block > > device, > one to commit our transaction (vs 4+ before). For overwrites, we'd have
Re: newstore direction
On Tue, Oct 20, 2015 at 3:49 AM, Sage Weilwrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb changes > land... the kv commit is currently 2-3). So two people are managing > metadata, here: the fs managing the file metadata (with its own > journal) and the kv backend (with its journal). > > - On read we have to open files by name, which means traversing the fs > namespace. Newstore tries to keep it as flat and simple as possible, but > at a minimum it is a couple btree lookups. We'd love to use open by > handle (which would reduce this to 1 btree traversal), but running > the daemon as ceph and not root makes that hard... > > - ...and file systems insist on updating mtime on writes, even when it is > a overwrite with no allocation changes. (We don't care about mtime.) > O_NOCMTIME patches exist but it is hard to get these past the kernel > brainfreeze. > > - XFS is (probably) never going going to give us data checksums, which we > want desperately. > > But what's the alternative? My thought is to just bite the bullet and > consume a raw block device directly. Write an allocator, hopefully keep > it pretty simple, and manage it in kv store along with all of our other > metadata. This is really a tough decision. Although making a block device based objectstore never walk out my mind since two years ago. We would much more concern about the effective of space utilization compared to local fs, the buggy, the consuming time to build a tiny local filesystem. I'm a little afraid of we would stuck into > > Wins: > > - 2 IOs for most: one to write the data to unused space in the block > device, one to commit our transaction (vs 4+ before). For overwrites, > we'd have one io to do our write-ahead log (kv journal), then do > the overwrite async (vs 4+ before). Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL area from my perf. > > - No concern about mtime getting in the way > > - Faster reads (no fs lookup) > > - Similarly sized metadata for most objects. If we assume most objects > are not fragmented, then the metadata to store the block offsets is about > the same size as the metadata to store the filenames we have now. > > Problems: > > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. > > - We have to write and maintain an allocator. I'm still optimistic this > can be reasonbly simple, especially for the flash case (where > fragmentation isn't such an issue as long as our blocks are reasonbly > sized). For disk we may beed to be moderately clever. > > - We'll need a fsck to ensure our internal metadata is consistent. The > good news is it'll just need to validate what we have stored in the kv > store. > > Other thoughts: > > - We might want to consider whether dm-thin or bcache or other block > layers might help us with elasticity of file vs block areas. > > - Rocksdb can push colder data to a second directory, so we could have a > fast ssd primary area (for wal and most metadata) and a second hdd > directory for stuff it has to push off. Then have a conservative amount > of file space on the hdd. If our block fills up, use the existing file > mechanism to put data there too. (But then we have to maintain both the > current kv + file approach and not go all-in on kv + block.) A complex way... Actually I would like to employ FileStore2 impl, which means we still use FileJournal(or alike ..). But we need to employ more memory to keep metadata/xattrs and use aio+dio to flush disk. A userspace pagecache needed to be impl. Then we can skip journal if full write, because osd is pg isolation we could make a barrier for single pg when skipping journal. @Sage Is there other concerns for filestore skip journal? In a word, I like the model that filestore owns, but we need to have a big refactor for existing impl. Sorry to disturb the thought > > Thoughts? > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat --