RE: Notes from a discussion a design to allow EC overwrites
This scheme fundamentally relies on the temporary objects "gracefully" transitioning into being portions of full-up long-term durable objects. This means that if the allocation size for a temporary object significantly mismatches the size of the mutation (partial stripe write) you're creating a problem that's proportional to the mismatch size. So either, NewStore is able to efficiently allocate small chunks OR You have some kind of background cleanup process that reclaims that space (i.e., a "straightener") The right choices depend on having a shared notion of the operational profile that you're trying to optimize for. The fundamental question becomes, are you going to optimize for small-block random writes? In my experience this is a key use-case in virtually every customer's evaluation scenario. I believe we MUST make this case reasonably efficient. It seems to me that the lowest-complexity "fix" for the problem is to teach NewStore to have two different allocation sizes (big and small :)). Naturally the allocator becomes more complex. Worst case, you're now left with the garbage collection problem. Which I suspect could be punted to a subsequent release (i.e., I'm out of large blocks, but there's plenty of fragmented available space -- This can happen, but's a pretty pathological case which becomes rare-er and rare-er as you scale-out) Allen Samuels Software Architect, Emerging Storage Solutions 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Samuel Just [mailto:sj...@redhat.com] Sent: Friday, November 13, 2015 7:39 AM To: Sage Weil <sw...@redhat.com> Cc: ceph-devel@vger.kernel.org; Allen Samuels <allen.samu...@sandisk.com>; Durgin, Josh <jdur...@redhat.com>; Farnum, Gregory <gfar...@redhat.com> Subject: Re: Notes from a discussion a design to allow EC overwrites Lazily persisting the intermediate entries would certainly also work, but there's an argument that it needlessly adds to the write transaction. Actually, we probably want to avoid having small writes be full stripe writes -- with a 8+3 code the difference between modifying a single stripelet and modifying the full stripe is 4 writes vs 11 writes. It means that during peering, any log we can find (particularly the shortest one) from the most recent active interval isn't an upper bound on writes committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd have to think carefully about the implications of that. -Sam On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sw...@redhat.com> wrote: > On Thu, 12 Nov 2015, Samuel Just wrote: >> I was present for a discussion about allowing EC overwrites and >> thought it would be good to summarize it for the list: >> >> Commit Protocol: >> 1) client sends write to primary >> 2) primary reads in partial stripes needed for partial stripe >> overwrites from replicas >> 3) primary sends prepares to participating replicas and queues its >> own prepare locally >> 4) once all prepares are complete, primary sends a commit to the >> client >> 5) primary sends applies to all participating replicas >> >> When we get the prepare, we write out a temp object with the data to >> be written. On apply, we use an objectstore primitive to atomically >> move those extents into the actual object. The log entry contains >> the name/id for the temp object so it can be applied on apply or removed on >> rollback. > > Currently we assume that temp objects are/can be cleared out on restart. > This will need to change. And we'll need to be careful that they get > cleaned out when peering completes (and the rollforward/rollback > decision is made. > > If the stripes are small, then the objectstore primitive may not > actually be that efficient. I'd suggest also hinting that the temp > object will be swapped later, so that the backend can, if it's small, > store it in a cheap temporary location in the expectation that it will get > rewritten later. > (In particular, the newstore allocation chunk is currently targetting > 512kb, and this will only be efficient with narrow stripes, so it'll > just get double-written. We'll want to keep the temp value in the kv > store [log, hopefully] and not bother to allocate disk and rewrite > it.) > >> Each log entry contains a list of the shard ids modified. During >> peering, we use the same protocol for choosing the authoritative log >> for the existing EC pool, except that we first take the longest >> candidate log and use it to extend shorter logs until they hit an entry they >> should have witnessed, but didn't. >> >> Implicit in the above scheme is the fact that if an ob
RE: Question about how rebuild works.
So the current algorithm optimizes for minimum period of cluster degradation at the expense of degrading MTTDL. So in the 3x replication case, the MTTR(two failure data) is somewhere between 1x and 2x the MTTR of a single failure -- depending on the phase alignment of the first and second rebuild. The average case would be 1.5x and this is inverse with the MTTDL, i.e., this behavior cuts the MTTDL in half. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum Sent: Friday, November 06, 2015 8:53 AM To: Samuel Just <sj...@redhat.com> Cc: ceph-devel <ceph-devel@vger.kernel.org> Subject: Re: Question about how rebuild works. Yeah, I'm more concerned about individual object durability. This seems like a good way (in ongoing flapping or whatever) for objects at the tail end of a PG to never get properly replicated even as we expend lots of IO repeatedly recovering earlier objects which are better-replicated. :/ Perhaps min_size et al make this a moot point, but...I don't think so. Haven't worked it all the way through. -Greg On Fri, Nov 6, 2015 at 8:48 AM, Samuel Just <sj...@redhat.com> wrote: > Nope, it's worse, there could be arbitrary portions of backfilled and > unbackfilled portions on any particular incomplete osd. We'd need a > backfilled_regions field with a type like map<hobject_t, hobject_t> > mapping backfilled regions begin->end. It's pretty tedious, but > doable provided that we bound how large the mapping gets. I'm > skeptical about how large an effect this would actually have on > overall durability (how frequent is this case?). Once Allen does the > math, we'll have a better idea :) -Sam > > On Fri, Nov 6, 2015 at 8:43 AM, Gregory Farnum <gfar...@redhat.com> wrote: >> Argh, I guess I was wrong. Sorry for the misinformation, all! :( >> >> If we were to try and do this, Sam, do you have any idea how much it >> would take? Presumably we'd have to add a backfill_begin marker to >> bookend with last_backfill_started, and then everywhere we send over >> object ops we'd have to compare against both of those values. But I'm >> not sure how many sites that's likely to be, what other kinds of >> paths rely on last_backfill_started, or if I'm missing something. >> -Greg >> >> On Fri, Nov 6, 2015 at 8:30 AM, Samuel Just <sj...@redhat.com> wrote: >>> What it actually does is rebuild 3 until it catches up with 2 and >>> then it rebuilds them in parallel (to minimize reads). Optimally, >>> we'd start 3 from where 2 left off and then circle back, but we'd >>> have to complicate the metadata we use to track backfill. >>> -Sam -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
How would this kind of split affect small transactions? Will each split be separately transactionally consistent or is there some kind of meta-transaction that synchronizes each of the splits? Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just Sent: Friday, October 23, 2015 8:42 AM To: James (Fei) Liu-SSI <james@ssi.samsung.com> Cc: Sage Weil <sw...@redhat.com>; Ric Wheeler <rwhee...@redhat.com>; Orit Wasserman <owass...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction Since the changes which moved the pg log and the pg info into the pg object space, I think it's now the case that any transaction submitted to the objectstore updates a disjoint range of objects determined by the sequencer. It might be easier to exploit that parallelism if we control allocation and allocation related metadata. We could split the store into N pieces which partition the pg space (one additional one for the meta sequencer?) with one rocksdb instance for each. Space could then be parcelled out in large pieces (small frequency of global allocation decisions) and managed more finely within each partition. The main challenge would be avoiding internal fragmentation of those, but at least defragmentation can be managed on a per-partition basis. Such parallelism is probably necessary to exploit the full throughput of some ssds. -Sam On Thu, Oct 22, 2015 at 10:42 AM, James (Fei) Liu-SSI <james@ssi.samsung.com> wrote: > Hi Sage and other fellow cephers, > I truly share the pains with you all about filesystem while I am working > on objectstore to improve the performance. As mentioned , there is nothing > wrong with filesystem. Just the Ceph as one of use case need more supports > but not provided in near future by filesystem no matter what reasons. > >There are so many techniques pop out which can help to improve > performance of OSD. User space driver(DPDK from Intel) is one of them. It > not only gives you the storage allocator, also gives you the thread > scheduling support, CPU affinity , NUMA friendly, polling which might > fundamentally change the performance of objectstore. It should not be hard > to improve CPU utilization 3x~5x times, higher IOPS etc. > I totally agreed that goal of filestore is to gives enough support for > filesystem with either 1 ,1b, or 2 solutions. In my humble opinion , The new > design goal of objectstore should focus on giving the best performance for > OSD with new techniques. These two goals are not going to conflict with each > other. They are just for different purposes to make Ceph not only more > stable but also better. > > Scylla mentioned by Orit is a good example . > > Thanks all. > > Regards, > James > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: Thursday, October 22, 2015 5:50 AM > To: Ric Wheeler > Cc: Orit Wasserman; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On Wed, 21 Oct 2015, Ric Wheeler wrote: >> You will have to trust me on this as the Red Hat person who spoke to >> pretty much all of our key customers about local file systems and >> storage - customers all have migrated over to using normal file systems >> under Oracle/DB2. >> Typically, they use XFS or ext4. I don't know of any non-standard >> file systems and only have seen one account running on a raw block >> store in 8 years >> :) >> >> If you have a pre-allocated file and write using O_DIRECT, your IO >> path is identical in terms of IO's sent to the device. >> >> If we are causing additional IO's, then we really need to spend some >> time talking to the local file system gurus about this in detail. I >> can help with that conversation. > > If the file is truly preallocated (that is, prewritten with zeros... > fallocate doesn't help here because the extents is marked unwritten), > then > sure: there is very little change in the data path. > > But at that point, what is the point? This only works if you have one (or a > few) huge files and the user space app already has all the complexity of a > filesystem-like thing (with its own internal journal, allocators, garbage > collection, etc.). Do they just do this to ease administrative tasks like > backup? > > > This is the fundamental tradeoff: > > 1) We have a file per object. We fsync like crazy and the fact that the
RE: newstore direction
I am pushing internally to open-source ZetaScale. Recent events may or may not affect that trajectory -- stay tuned. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Wednesday, October 21, 2015 10:45 PM To: Allen Samuels <allen.samu...@sandisk.com>; Ric Wheeler <rwhee...@redhat.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 05:06 AM, Allen Samuels wrote: > I agree that moving newStore to raw block is going to be a significant > development effort. But the current scheme of using a KV store combined with > a normal file system is always going to be problematic (FileStore or > NewStore). This is caused by the transactional requirements of the > ObjectStore interface, essentially you need to make transactionally > consistent updates to two indexes, one of which doesn't understand > transactions (File Systems) and can never be tightly-connected to the other > one. > > You'll always be able to make this "loosely coupled" approach work, but it > will never be optimal. The real question is whether the performance > difference of a suboptimal implementation is something that you can live with > compared to the longer gestation period of the more optimal implementation. > Clearly, Sage believes that the performance difference is significant or he > wouldn't have kicked off this discussion in the first place. > > While I think we can all agree that writing a full-up KV and raw-block > ObjectStore is a significant amount of work. I will offer the case that the > "loosely couple" scheme may not have as much time-to-market advantage as it > appears to have. One example: NewStore performance is limited due to bugs in > XFS that won't be fixed in the field for quite some time (it'll take at least > a couple of years before a patched version of XFS will be widely deployed at > customer environments). > > Another example: Sage has just had to substantially rework the journaling > code of rocksDB. > > In short, as you can tell, I'm full throated in favor of going down the > optimal route. > > Internally at Sandisk, we have a KV store that is optimized for flash (it's > called ZetaScale). We have extended it with a raw block allocator just as > Sage is now proposing to do. Our internal performance measurements show a > significant advantage over the current NewStore. That performance advantage > stems primarily from two things: Has there been any discussion regarding opensourcing zetascale? > > (1) ZetaScale uses a B+-tree internally rather than an LSM tree > (levelDB/RocksDB). LSM trees experience exponential increase in write > amplification (cost of an insert) as the amount of data under management > increases. B+tree write-amplification is nearly constant independent of the > size of data under management. As the KV database gets larger (Since newStore > is effectively moving the per-file inode into the kv data base. Don't forget > checksums that Sage want's to add :)) this performance delta swamps all > others. > (2) Having a KV and a file-system causes a double lookup. This costs CPU time > and disk accesses to page in data structure indexes, metadata efficiency > decreases. > > You can't avoid (2) as long as you're using a file system. > > Yes an LSM tree performs better on HDD than does a B-tree, which is a good > argument for keeping the KV module pluggable. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler > Sent: Tuesday, October 20, 2015 11:32 AM > To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > On 10/19/2015 03:49 PM, Sage Weil wrote: >> The current design is based on two simple ideas: >> >>1) a key/value interface is better way to manage all of our >> internal metadata (object metadata, attrs, layout, collection >> membership, write-ahead logging, overlay data, etc.) >> >>2) a file system is well suited for storage object data (as files). >> >> So far 1 is working out well, but I'm questioning the wisdom of #2. >> A few >> things: >> >>- We currently write the data to the file, fsync, then commit the >> kv transaction. That's
RE: newstore direction
One of the biggest changes that flash is making in the storage world is that the way basic trade-offs in storage management software architecture are being affected. In the HDD world CPU time per IOP was relatively inconsequential, i.e., it had little effect on overall performance which was limited by the physics of the hard drive. Flash is now inverting that situation. When you look at the performance levels being delivered in the latest generation of NVMe SSDs you rapidly see that that storage itself is generally no longer the bottleneck (speaking about BW, not latency of course) but rather it's the system sitting in front of the storage that is the bottleneck. Generally it's the CPU cost of an IOP. When Sandisk first starting working with Ceph (Dumpling) the design of librados and the OSD lead to the situation that the CPU cost of an IOP was dominated by context switches and network socket handling. Over time, much of that has been addressed. The socket handling code has been re-written (more than once!) some of the internal queueing in the OSD (and the associated context switches) have been eliminated. As the CPU costs have dropped, performance on flash has improved accordingly. Because we didn't want to completely re-write the OSD (time-to-market and stability drove that decision), we didn't move it from the current "thread per IOP" model into a truly asynchronous "thread per CPU core" model that essentially eliminates context switches in the IO path. But a fully optimized OSD would go down that path (at least part-way). I believe it's been proposed in the past. Perhaps a hybrid "fast-path" style could get most of the benefits while preserving much of the legacy code. I believe this trend toward thread-per-core software development will also tend to support the "do it in user-space" trend. That's because most of the kernel and file-system interface is architected around the blocking "thread-per-IOP" model and is unlikely to change in the future. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Martin Millnert [mailto:mar...@millnert.se] Sent: Thursday, October 22, 2015 6:20 AM To: Mark Nelson <mnel...@redhat.com> Cc: Ric Wheeler <rwhee...@redhat.com>; Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction Adding 2c On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > My thought is that there is some inflection point where the userland > kvstore/block approach is going to be less work, for everyone I think, > than trying to quickly discover, understand, fix, and push upstream > patches that sometimes only really benefit us. I don't know if we've > truly hit that that point, but it's tough for me to find flaws with > Sage's argument. Regarding the userland / kernel land aspect of the topic, there are further aspects AFAIK not yet addressed in the thread: In the networking world, there's been development on memory mapped (multiple approaches exist) userland networking, which for packet management has the benefit of - for very, very specific applications of networking code - avoiding e.g. per-packet context switches etc, and streamlining processor cache management performance. People have gone as far as removing CPU cores from CPU scheduler to completely dedicate them to the networking task at hand (cache optimizations). There are various latency/throughput (bulking) optimizations applicable, but at the end of the day, it's about keeping the CPU bus busy with "revenue" bus traffic. Granted, storage IO operations may be much heavier in cycle counts for context switches to ever appear as a problem in themselves, certainly for slower SSDs and HDDs. However, when going for truly high performance IO, *every* hurdle in the data path counts toward the total latency. (And really, high performance random IO characteristics approaches the networking, per-packet handling characteristics). Now, I'm not really suggesting memory-mapping a storage device to user space, not at all, but having better control over the data path for a very specific use case, reduces dependency on the code that works as best as possible for the general case, and allows for very purpose-built code, to address a narrow set of requirements. ("Ceph storage cluster backend" isn't a typical FS use case.) It also decouples dependencies on users i.e. waiting for the next distro release before being able to take up the benefits of improvements to the storage code. A random google came up with related data on where "doing something way different" /can/ have significant benefits: http://phunq.net/pipermail/tux3/2015-April/002147.ht
RE: newstore direction
Fixing the bug doesn't take a long time. Getting it deployed is where the delay is. Many companies standardize on a particular release of a particular distro. Getting them to switch to a new release -- even a "bug fix" point release -- is a major undertaking that often is a complete roadblock. Just my experience. YMMV. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Wednesday, October 21, 2015 8:24 PM To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 06:06 AM, Allen Samuels wrote: > I agree that moving newStore to raw block is going to be a significant > development effort. But the current scheme of using a KV store combined with > a normal file system is always going to be problematic (FileStore or > NewStore). This is caused by the transactional requirements of the > ObjectStore interface, essentially you need to make transactionally > consistent updates to two indexes, one of which doesn't understand > transactions (File Systems) and can never be tightly-connected to the other > one. > > You'll always be able to make this "loosely coupled" approach work, but it > will never be optimal. The real question is whether the performance > difference of a suboptimal implementation is something that you can live with > compared to the longer gestation period of the more optimal implementation. > Clearly, Sage believes that the performance difference is significant or he > wouldn't have kicked off this discussion in the first place. I think that we need to work with the existing stack - measure and do some collaborative analysis - before we throw out decades of work. Very hard to understand why the local file system is a barrier for performance in this case when it is not an issue in existing enterprise applications. We need some deep analysis with some local file system experts thrown in to validate the concerns. > > While I think we can all agree that writing a full-up KV and raw-block > ObjectStore is a significant amount of work. I will offer the case that the > "loosely couple" scheme may not have as much time-to-market advantage as it > appears to have. One example: NewStore performance is limited due to bugs in > XFS that won't be fixed in the field for quite some time (it'll take at least > a couple of years before a patched version of XFS will be widely deployed at > customer environments). Not clear what bugs you are thinking of or why you think fixing bugs will take a long time to hit the field in XFS. Red Hat has most of the XFS developers on staff and we actively backport fixes and ship them, other distros do as well. Never seen a "bug" take a couple of years to hit users. Regards, Ric > > Another example: Sage has just had to substantially rework the journaling > code of rocksDB. > > In short, as you can tell, I'm full throated in favor of going down the > optimal route. > > Internally at Sandisk, we have a KV store that is optimized for flash (it's > called ZetaScale). We have extended it with a raw block allocator just as > Sage is now proposing to do. Our internal performance measurements show a > significant advantage over the current NewStore. That performance advantage > stems primarily from two things: > > (1) ZetaScale uses a B+-tree internally rather than an LSM tree > (levelDB/RocksDB). LSM trees experience exponential increase in write > amplification (cost of an insert) as the amount of data under management > increases. B+tree write-amplification is nearly constant independent of the > size of data under management. As the KV database gets larger (Since newStore > is effectively moving the per-file inode into the kv data base. Don't forget > checksums that Sage want's to add :)) this performance delta swamps all > others. > (2) Having a KV and a file-system causes a double lookup. This costs CPU time > and disk accesses to page in data structure indexes, metadata efficiency > decreases. > > You can't avoid (2) as long as you're using a file system. > > Yes an LSM tree performs better on HDD than does a B-tree, which is a good > argument for keeping the KV module pluggable. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org]
RE: newstore direction
Actually Range queries are an important part of the performance story and random read speed doesn't really solve the problem. When you're doing a scrub, you need to enumerate the objects in a specific order on multiple nodes -- so that they can compare the contents of their stores in order to determine if data cleaning needs to take place. If you don't have in-order enumeration in your basic data structure (which NVMKV doesn't have) then you're forced to sort the directory before you can respond to an enumeration. That sort will either consume huge amounts of IOPS OR huge amounts of DRAM. Regardless of the choice, you'll see a significant degradation of performance while the scrub is ongoing -- which is one of the biggest problems with clustered systems (expensive and extensive maintenance operations). Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] Sent: Thursday, October 22, 2015 1:10 AM To: Mark Nelson <mnel...@redhat.com>; Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com> Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy <somnath@sandisk.com>; ceph-devel@vger.kernel.org Subject: RE: newstore direction We did evaluate whether NVMKV could be implemented by non-fusionIO ssds, i.e re-invent an NVMKV, the final conclusion sounds like it's not hard with persistent memory(which will be available soon). But yeah, NVMKV will not work if no PM is present---persist the hashing table to SSD is not practicable. Range query seems not a very big issue as the random read performance of nowadays SSD is more than enough, I mean, even we break all sequential to random (typically 70-80K IOPS which is ~300MB/s), the performance still good enough. Anyway, I think for the high IOPS case, it's hard for the consumer to play well on SSDs from different vendors, would be better to leave it to SSD vendor, something like Openstack Cinder's structure. a vendor has the responsibility to maintain their drivers to ceph and take care the performance. > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Wednesday, October 21, 2015 9:36 PM > To: Allen Samuels; Sage Weil; Chen, Xiaoxi > Cc: James (Fei) Liu-SSI; Somnath Roy; ceph-devel@vger.kernel.org > Subject: Re: newstore direction > > Thanks Allen! The devil is always in the details. Know of anything > else that looks promising? > > Mark > > On 10/21/2015 05:06 AM, Allen Samuels wrote: > > I doubt that NVMKV will be useful for two reasons: > > > > (1) It relies on the unique sparse-mapping addressing capabilities > > of the FusionIO VSL interface, it won't run on standard SSDs > > (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no > range operations on keys). This is pretty much required for deep scrubbing. > > > > > > Allen Samuels > > Software Architect, Fellow, Systems and Software Solutions > > > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson > > Sent: Tuesday, October 20, 2015 6:20 AM > > To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi > > <xiaoxi.c...@intel.com> > > Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy > > <somnath@sandisk.com>; ceph-devel@vger.kernel.org > > Subject: Re: newstore direction > > > > On 10/20/2015 07:30 AM, Sage Weil wrote: > >> On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: > >>> +1, nowadays K-V DB care more about very small key-value pairs, > >>> +say > >>> several bytes to a few KB, but in SSD case we only care about 4KB > >>> or 8KB. In this way, NVMKV is a good design and seems some of the > >>> SSD vendor are also trying to build this kind of interface, we had > >>> a NVM-L library but still under development. > >> > >> Do you have an NVMKV link? I see a paper and a stale github repo.. > >> not sure if I'm looking at the right thing. > >> > >> My concern with using a key/value interface for the object data is > >> that you end up with lots of key/value pairs (e.g., $inode_$offset > >> = > >> $4kb_of_data) that is pretty inefficient to store and (depending on > >> the > >> implementation) tends to break alignment. I don't think these > >> interfaces are targetted toward block-
RE: newstore direction
I agree. My only point was that you still have to factor this time into the argument that by continuing to put NewStore on top of a file system you'll get to a stable system much sooner than the longer development path of doing your own raw storage allocator. IMO, once you factor that into the equation the "on top of an FS" path doesn't look like such a clear winner. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Ric Wheeler [mailto:rwhee...@redhat.com] Sent: Thursday, October 22, 2015 10:17 AM To: Allen Samuels <allen.samu...@sandisk.com>; Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/21/2015 08:53 PM, Allen Samuels wrote: > Fixing the bug doesn't take a long time. Getting it deployed is where the > delay is. Many companies standardize on a particular release of a particular > distro. Getting them to switch to a new release -- even a "bug fix" point > release -- is a major undertaking that often is a complete roadblock. Just my > experience. YMMV. > Customers do control the pace that they upgrade their machines, but we put out fixes on a very regular pace. A lot of customers will get fixes without having to qualify a full new release (i.e., fixes come out between major and minor releases are easy). If someone is deploying a critical server for storage, then it falls back on the storage software team to help guide them and encourage them to update when needed (and no promises of success, but people move if the win is big. If it is not, they can wait). ric PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: newstore direction
I agree that moving newStore to raw block is going to be a significant development effort. But the current scheme of using a KV store combined with a normal file system is always going to be problematic (FileStore or NewStore). This is caused by the transactional requirements of the ObjectStore interface, essentially you need to make transactionally consistent updates to two indexes, one of which doesn't understand transactions (File Systems) and can never be tightly-connected to the other one. You'll always be able to make this "loosely coupled" approach work, but it will never be optimal. The real question is whether the performance difference of a suboptimal implementation is something that you can live with compared to the longer gestation period of the more optimal implementation. Clearly, Sage believes that the performance difference is significant or he wouldn't have kicked off this discussion in the first place. While I think we can all agree that writing a full-up KV and raw-block ObjectStore is a significant amount of work. I will offer the case that the "loosely couple" scheme may not have as much time-to-market advantage as it appears to have. One example: NewStore performance is limited due to bugs in XFS that won't be fixed in the field for quite some time (it'll take at least a couple of years before a patched version of XFS will be widely deployed at customer environments). Another example: Sage has just had to substantially rework the journaling code of rocksDB. In short, as you can tell, I'm full throated in favor of going down the optimal route. Internally at Sandisk, we have a KV store that is optimized for flash (it's called ZetaScale). We have extended it with a raw block allocator just as Sage is now proposing to do. Our internal performance measurements show a significant advantage over the current NewStore. That performance advantage stems primarily from two things: (1) ZetaScale uses a B+-tree internally rather than an LSM tree (levelDB/RocksDB). LSM trees experience exponential increase in write amplification (cost of an insert) as the amount of data under management increases. B+tree write-amplification is nearly constant independent of the size of data under management. As the KV database gets larger (Since newStore is effectively moving the per-file inode into the kv data base. Don't forget checksums that Sage want's to add :)) this performance delta swamps all others. (2) Having a KV and a file-system causes a double lookup. This costs CPU time and disk accesses to page in data structure indexes, metadata efficiency decreases. You can't avoid (2) as long as you're using a file system. Yes an LSM tree performs better on HDD than does a B-tree, which is a good argument for keeping the KV module pluggable. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ric Wheeler Sent: Tuesday, October 20, 2015 11:32 AM To: Sage Weil <sw...@redhat.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/19/2015 03:49 PM, Sage Weil wrote: > The current design is based on two simple ideas: > > 1) a key/value interface is better way to manage all of our internal > metadata (object metadata, attrs, layout, collection membership, > write-ahead logging, overlay data, etc.) > > 2) a file system is well suited for storage object data (as files). > > So far 1 is working out well, but I'm questioning the wisdom of #2. A > few > things: > > - We currently write the data to the file, fsync, then commit the kv > transaction. That's at least 3 IOs: one for the data, one for the fs > journal, one for the kv txn to commit (at least once my rocksdb > changes land... the kv commit is currently 2-3). So two people are > managing metadata, here: the fs managing the file metadata (with its > own > journal) and the kv backend (with its journal). If all of the fsync()'s fall into the same backing file system, are you sure that each fsync() takes the same time? Depending on the local FS implementation of course, but the order of issuing those fsync()'s can effectively make some of them no-ops. > > - On read we have to open files by name, which means traversing the > fs namespace. Newstore tries to keep it as flat and simple as > possible, but at a minimum it is a couple btree lookups. We'd love to > use open by handle (which would reduce this to 1 btree traversal), but > running the daemon as ceph and not root makes that hard... This seems like a a pretty low hurdle to overcome. > > - ...and file systems insist on updating mtime on writes, even when > it is a overwrite with no
RE: newstore direction
I doubt that NVMKV will be useful for two reasons: (1) It relies on the unique sparse-mapping addressing capabilities of the FusionIO VSL interface, it won't run on standard SSDs (2) NVMKV doesn't provide any form of in-order enumeration (i.e., no range operations on keys). This is pretty much required for deep scrubbing. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Tuesday, October 20, 2015 6:20 AM To: Sage Weil <sw...@redhat.com>; Chen, Xiaoxi <xiaoxi.c...@intel.com> Cc: James (Fei) Liu-SSI <james@ssi.samsung.com>; Somnath Roy <somnath@sandisk.com>; ceph-devel@vger.kernel.org Subject: Re: newstore direction On 10/20/2015 07:30 AM, Sage Weil wrote: > On Tue, 20 Oct 2015, Chen, Xiaoxi wrote: >> +1, nowadays K-V DB care more about very small key-value pairs, say >> several bytes to a few KB, but in SSD case we only care about 4KB or >> 8KB. In this way, NVMKV is a good design and seems some of the SSD >> vendor are also trying to build this kind of interface, we had a >> NVM-L library but still under development. > > Do you have an NVMKV link? I see a paper and a stale github repo.. > not sure if I'm looking at the right thing. > > My concern with using a key/value interface for the object data is > that you end up with lots of key/value pairs (e.g., $inode_$offset = > $4kb_of_data) that is pretty inefficient to store and (depending on > the > implementation) tends to break alignment. I don't think these > interfaces are targetted toward block-sized/aligned payloads. Storing > just the metadata (block allocation map) w/ the kv api and storing the > data directly on a block/page interface makes more sense to me. > > sage I get the feeling that some of the folks that were involved with nvmkv at Fusion IO have left. Nisha Talagala is now out at Parallel Systems for instance. http://pmem.io might be a better bet, though I haven't looked closely at it. Mark > > >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>> ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI >>> Sent: Tuesday, October 20, 2015 6:21 AM >>> To: Sage Weil; Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> Hi Sage and Somnath, >>>In my humble opinion, There is another more aggressive solution >>> than raw block device base keyvalue store as backend for >>> objectstore. The new key value SSD device with transaction support would >>> be ideal to solve the issues. >>> First of all, it is raw SSD device. Secondly , It provides key value >>> interface directly from SSD. Thirdly, it can provide transaction >>> support, consistency will be guaranteed by hardware device. It >>> pretty much satisfied all of objectstore needs without any extra >>> overhead since there is not any extra layer in between device and >>> objectstore. >>> Either way, I strongly support to have CEPH own data format >>> instead of relying on filesystem. >>> >>>Regards, >>>James >>> >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >>> ow...@vger.kernel.org] On Behalf Of Sage Weil >>> Sent: Monday, October 19, 2015 1:55 PM >>> To: Somnath Roy >>> Cc: ceph-devel@vger.kernel.org >>> Subject: RE: newstore direction >>> >>> On Mon, 19 Oct 2015, Somnath Roy wrote: >>>> Sage, >>>> I fully support that. If we want to saturate SSDs , we need to get >>>> rid of this filesystem overhead (which I am in process of measuring). >>>> Also, it will be good if we can eliminate the dependency on the k/v >>>> dbs (for storing allocators and all). The reason is the unknown >>>> write amps they causes. >>> >>> My hope is to keep behing the KeyValueDB interface (and/more change >>> it as >>> appropriate) so that other backends can be easily swapped in (e.g. a >>> btree- based one for high-end flash). >>> >>> sage >>> >>> >>>> >>>> Thanks & Regards >>>> Somnath >>>> >>>> >>>> -Original Message- >>>> From: ceph-devel-ow...@vger.kernel.org >>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behal
RE: loadable objectstore
Yes, I'm referring to the C++ vtable. Allen Samuels Software Architect, Emerging Storage Solutions 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com] Sent: Monday, September 14, 2015 9:48 AM To: Allen Samuels <allen.samu...@sandisk.com>; Varada Kari <varada.k...@sandisk.com>; Sage Weil <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic Dachary <l...@dachary.org> Cc: ceph-devel <ceph-devel@vger.kernel.org> Subject: RE: loadable objectstore Hi Allen, I am not exactly sure what the vtable is. Is the vtable same as vtable from C++ object concept? IMHP, the procedure linkage table was used to redirect position-independent function calls to absolute location of function based on the ELF format spec[1]. The performance hit for shared library might be negligible. There is very old articles to talk about the performance tests between shared vs static libs[2]. I am not following up the lasted complier/linker technologies any more. Would be great to know any new updates. [1]http://www.skyfree.org/linux/references/ELF_Format.pdf [2]https://gcc.gnu.org/ml/gcc/2004-06/msg01956.html Regards, James -----Original Message- From: Allen Samuels [mailto:allen.samu...@sandisk.com] Sent: Saturday, September 12, 2015 1:35 PM To: Varada Kari; James (Fei) Liu-SSI; Sage Weil; Matt W. Benjamin; Loic Dachary Cc: ceph-devel Subject: RE: loadable objectstore Performance impact after initialization will be zero. All of the call sequences are done as vtable dynamic dispatches on the global ObjectStore instance. This type of call sequence doesn't matter whether it's dynamic or statically linked, they are the same (a simple indirection through the vtbl which is loaded from a known constant offset in the object). Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari Sent: Friday, September 11, 2015 9:34 PM To: James (Fei) Liu-SSI <james@ssi.samsung.com>; Sage Weil <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic Dachary <l...@dachary.org> Cc: ceph-devel <ceph-devel@vger.kernel.org> Subject: RE: loadable objectstore Hi James, Please find the responses inline. varada > -Original Message- > From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com] > Sent: Saturday, September 12, 2015 12:13 AM > To: Varada Kari <varada.k...@sandisk.com>; Sage Weil > <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic > Dachary <l...@dachary.org> > Cc: ceph-devel <ceph-devel@vger.kernel.org> > Subject: RE: loadable objectstore > > Hi Varada, > Got a chance to go through the code. Great job. It is much cleaner . > Several > questions: > 1. What you think about the performance impact with the new > implementation? Such as dynamic library vs static link? [Varada Kari] Haven't measured the performance yet, but there will be some hit due to static vs dynamic. But that shouldn't be a major degradation, but I will hold on till we have some perf runs to figure that out. > 2. Could any vendor just provide a objectstore interfaces complied > dynamic binary library for their own storage engine with new factory > framework? [Varada Kari] That was one of the design motives for this change. Yes any backend adhering the interfaces of object store can integrate with osd. All they need to do provide a factory interface and the required version and init functionality additionally to all the required object store interfaces. > > Regards, > James > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent: Friday, September 11, 2015 3:28 AM > To: Sage Weil; Matt W. Benjamin; Loic Dachary > Cc: ceph-devel > Subject: RE: loadable objectstore > > Hi Sage/ Matt, > > I have submitted the pull request based on wip-plugin branch for the > object store factory implementation at https://github.com/ceph/ceph/pull/5884 > . > Haven't rebased to the master yet. Working on rebase and including new > store in the factory implementation. Please have a look and let me > know your comments. Will submit a rebased PR soon with new store integration. > > Thanks, > Varada > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent: Friday, July 03, 2015 7:31 PM > To: Sage
RE: loadable objectstore
Performance impact after initialization will be zero. All of the call sequences are done as vtable dynamic dispatches on the global ObjectStore instance. This type of call sequence doesn't matter whether it's dynamic or statically linked, they are the same (a simple indirection through the vtbl which is loaded from a known constant offset in the object). Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Varada Kari Sent: Friday, September 11, 2015 9:34 PM To: James (Fei) Liu-SSI <james@ssi.samsung.com>; Sage Weil <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic Dachary <l...@dachary.org> Cc: ceph-devel <ceph-devel@vger.kernel.org> Subject: RE: loadable objectstore Hi James, Please find the responses inline. varada > -Original Message- > From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com] > Sent: Saturday, September 12, 2015 12:13 AM > To: Varada Kari <varada.k...@sandisk.com>; Sage Weil > <s...@newdream.net>; Matt W. Benjamin <m...@cohortfs.com>; Loic > Dachary <l...@dachary.org> > Cc: ceph-devel <ceph-devel@vger.kernel.org> > Subject: RE: loadable objectstore > > Hi Varada, > Got a chance to go through the code. Great job. It is much cleaner . Several > questions: > 1. What you think about the performance impact with the new > implementation? Such as dynamic library vs static link? [Varada Kari] Haven't measured the performance yet, but there will be some hit due to static vs dynamic. But that shouldn't be a major degradation, but I will hold on till we have some perf runs to figure that out. > 2. Could any vendor just provide a objectstore interfaces complied dynamic > binary library for their own storage engine with new factory framework? [Varada Kari] That was one of the design motives for this change. Yes any backend adhering the interfaces of object store can integrate with osd. All they need to do provide a factory interface and the required version and init functionality additionally to all the required object store interfaces. > > Regards, > James > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent: Friday, September 11, 2015 3:28 AM > To: Sage Weil; Matt W. Benjamin; Loic Dachary > Cc: ceph-devel > Subject: RE: loadable objectstore > > Hi Sage/ Matt, > > I have submitted the pull request based on wip-plugin branch for the object > store factory implementation at https://github.com/ceph/ceph/pull/5884 . > Haven't rebased to the master yet. Working on rebase and including new > store in the factory implementation. Please have a look and let me know > your comments. Will submit a rebased PR soon with new store integration. > > Thanks, > Varada > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Varada Kari > Sent: Friday, July 03, 2015 7:31 PM > To: Sage Weil <s...@newdream.net>; Adam Crume > <adamcr...@gmail.com> > Cc: Loic Dachary <l...@dachary.org>; ceph-devel de...@vger.kernel.org>; Matt W. Benjamin <m...@cohortfs.com> > Subject: RE: loadable objectstore > > Hi All, > > Not able to make much progress after making common as a shared object > along with object store. > Compilation of the test binaries are failing with > "./.libs/libceph_filestore.so: > undefined reference to `tracepoint_dlopen'". > > CXXLDceph_streamtest > ./.libs/libceph_filestore.so: undefined reference to `tracepoint_dlopen' > collect2: error: ld returned 1 exit status > make[3]: *** [ceph_streamtest] Error 1 > > But libfilestore.so is linked with lttng-ust. > > src/.libs$ ldd libceph_filestore.so > libceph_keyvaluestore.so.1 => /home/varada/obs-factory/plugin- > work/src/.libs/libceph_keyvaluestore.so.1 (0x7f5e50f5) > libceph_os.so.1 => /home/varada/obs-factory/plugin- > work/src/.libs/libceph_os.so.1 (0x7f5e4f93a000) > libcommon.so.1 => /home/varada/ obs-factory/plugin- > work/src/.libs/libcommon.so.1 (0x7f5e4b5df000) > liblttng-ust.so.0 => /usr/lib/x86_64-linux-gnu/liblttng-ust.so.0 > (0x7f5e4b179000) > liblttng-ust-tracepoint.so.0 => > /usr/lib/x86_64-linux-gnu/liblttng-ust- > tracepoint.so.0 (0x7f5e4a021000) > liburcu-bp.so.1 => /usr/lib/liburcu-bp.so.1 (0x7f5e49e1a000) > liburcu-cds.so.1 => /usr/lib/lib
RE: Inline dedup/compression
I was referring strictly to compression. Dedupe is a whole 'nother issue. I agree that dedupe on a per-OSD basis isn't interesting. It needs to be done at the pool level (or higher). Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Chaitanya Huilgol Sent: Thursday, August 20, 2015 9:43 PM To: Allen Samuels; Haomai Wang Cc: James (Fei) Liu-SSI; ceph-devel Subject: RE: Inline dedup/compression Hi, The original idea of dedupe was to make it cluster wide, If we go with a filestore or kevvalue-store based dedupe/compression then isn't it localized to the OSD? W.r.t Ceph architecture of object distribution, won't the probability of objects with same/similar data landing on the same OSD be pretty low? Regards, Chaitanya -Original Message- From: Allen Samuels Sent: Friday, August 21, 2015 9:07 AM To: Haomai Wang Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel Subject: RE: Inline dedup/compression XFS shouldn't have any trouble with the holes scheme. I don't know BTRFS as well, but I doubt it's significantly different. If we assume that the logical address space of a file is broken up into fixed sized chunks on fixed size boundaries (presumably a power of 2) then the implementation is quite straightforward. Picking the chunk size will be a key issue for performance. Unfortunately, there are competing desires. For best space utilization, you'll want the chunk size to be large, because on average you'll lose 1/2 of a file system sector/block for each chunk of compressed data. For best R/W performance, you'll want the chunk size to be small, because logically the file I/O size is equal to a chunk, i.e., on a write you might have to read the corresponding chunk, decompress it, insert the new data and recompress it. This gets super duper ugly on FileStore because you can't afford to crash during the re-write update and risk a partially updated chunk (this will give you garbage when you decompress it). This means that you'll have to log the entire chunk even if you're only re-writing a small portion of it. Hence the desire to make the chunksize small. I'm not as familiar with NewStore, but I don't think it's fundamentally much better. Basically any form of sub-chunk write-operation stinks in performance. Sub-chunk read operations aren't too bad unless the chunk size is ridiculously large. For best compression ratios, you'll want the chunk size to be at least equal to the history size if not 2 or 3 times larger (64K history size when using zlib, snappy is 32K or 64K for the latest version) The partial-block write problem doesn't exist for RGW objects and it's objects are probably already compressed. Meaning that you'll want to be able to convey the compression parameters to RADOS so that the backend knows what to do. I would add a per-file attribute that encodes the compression parameters: compression algorithm (zlib, snappy, ...) and chunksize. That would also provide backward compatibility and allow per-object compression diversity. Then you'd want to add verbiage to the individual access schemes to allow/disallow compression. For file systems you'd want that on a per-directory basis or perhaps even better a set of regular expressions. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 20, 2015 8:01 PM To: Allen Samuels Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel Subject: Re: Inline dedup/compression sorry, should be this blog(http://mysqlserverteam.com/innodb-transparent-page-compression/) On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang haomaiw...@gmail.com wrote: I found a blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/ ) about mysql innodb transparent compression. It's surprised that innodb will do it at low level(just like filestore in ceph) and rely it on filesystem file hole feature. I'm very suspect about the performance afeter storing lot's of *small* hole files on fs. If reliable, it would be easy that filestore/newstore impl alike feature. On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels allen.samu...@sandisk.com wrote: For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than
RE: Inline dedup/compression
XFS shouldn't have any trouble with the holes scheme. I don't know BTRFS as well, but I doubt it's significantly different. If we assume that the logical address space of a file is broken up into fixed sized chunks on fixed size boundaries (presumably a power of 2) then the implementation is quite straightforward. Picking the chunk size will be a key issue for performance. Unfortunately, there are competing desires. For best space utilization, you'll want the chunk size to be large, because on average you'll lose 1/2 of a file system sector/block for each chunk of compressed data. For best R/W performance, you'll want the chunk size to be small, because logically the file I/O size is equal to a chunk, i.e., on a write you might have to read the corresponding chunk, decompress it, insert the new data and recompress it. This gets super duper ugly on FileStore because you can't afford to crash during the re-write update and risk a partially updated chunk (this will give you garbage when you decompress it). This means that you'll have to log the entire chunk even if you're only re-writing a small portion of it. Hence the desire to make the chunksize small. I'm not as familiar with NewStore, but I don't think it's fundamentally much better. Basically any form of sub-chunk write-operation stinks in performance. Sub-chunk read operations aren't too bad unless the chunk size is ridiculously large. For best compression ratios, you'll want the chunk size to be at least equal to the history size if not 2 or 3 times larger (64K history size when using zlib, snappy is 32K or 64K for the latest version) The partial-block write problem doesn't exist for RGW objects and it's objects are probably already compressed. Meaning that you'll want to be able to convey the compression parameters to RADOS so that the backend knows what to do. I would add a per-file attribute that encodes the compression parameters: compression algorithm (zlib, snappy, ...) and chunksize. That would also provide backward compatibility and allow per-object compression diversity. Then you'd want to add verbiage to the individual access schemes to allow/disallow compression. For file systems you'd want that on a per-directory basis or perhaps even better a set of regular expressions. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Thursday, August 20, 2015 8:01 PM To: Allen Samuels Cc: Chaitanya Huilgol; James (Fei) Liu-SSI; ceph-devel Subject: Re: Inline dedup/compression sorry, should be this blog(http://mysqlserverteam.com/innodb-transparent-page-compression/) On Fri, Aug 21, 2015 at 10:51 AM, Haomai Wang haomaiw...@gmail.com wrote: I found a blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/ ) about mysql innodb transparent compression. It's surprised that innodb will do it at low level(just like filestore in ceph) and rely it on filesystem file hole feature. I'm very suspect about the performance afeter storing lot's of *small* hole files on fs. If reliable, it would be easy that filestore/newstore impl alike feature. On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels allen.samu...@sandisk.com wrote: For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count). Allen Samuels Software Architect, Emerging Storage Solutions 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Chaitanya Huilgol Sent: Thursday, July 02, 2015 3:50 AM To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang Cc: ceph-devel Subject: RE: Inline dedup/compression Hi James et.al , Here is an example for clarity, 1. Client Writes object object.abcd 2. Based on the crush rules, say OSD.a is the primary OSD which receives the write 3. OSD.a performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len [Header] [Seg1_sha, len
RE: Ceph Hackathon: More Memory Allocator Testing
It was a surprising result that the memory allocator is making such a large difference in performance. All of the recent work in fiddling with TCmalloc's and Jemalloc's various knobs and switches has been excellent a great example of group collaboration. But I think it's only a partial optimization of the underlying problem. The real take-away from this activity is that the code base is doing a LOT of memory allocation/deallocation which is consuming substantial CPU time-- regardless of how much we optimize the memory allocator, you can't get away from the fact that it macroscopically MATTERs. The better long-term solution is to reduce reliance on the general-purpose memory allocator and to implement strategies that are more specific to our usage model. What really needs to happen initially is to instrument the allocation/deallocation. Most likely we'll find that 80+% of the work is coming from just a few object classes and it will be easy to create custom allocation strategies for those usages. This will lead to even higher performance that's much less sensitive to easy-to-misconfigure environmental factors and the entire tcmalloc/jemalloc -- oops it uses more memory discussion will go away. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Wednesday, August 19, 2015 10:30 AM To: Alexandre DERUMIER Cc: Mark Nelson; ceph-devel Subject: RE: Ceph Hackathon: More Memory Allocator Testing Yes, it should be 1 per OSD... There is no doubt that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is relative to the number of threads running.. But, I don't know if number of threads is a factor for jemalloc.. Thanks Regards Somnath -Original Message- From: Alexandre DERUMIER [mailto:aderum...@odiso.com] Sent: Wednesday, August 19, 2015 9:55 AM To: Somnath Roy Cc: Mark Nelson; ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing I think that tcmalloc have a fixed size (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. I think it is per tcmalloc instance loaded , so, at least with num_osds * num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. What is num_tcmalloc_instance ? I think 1 osd process use a defined TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES size ? I'm saying that, because I have exactly the same bug, client side, with librbd + tcmalloc + qemu + iothreads. When I defined too much iothread threads, I'm hitting the bug directly. (can reproduce 100%). Like the thread_cache size is divide by number of threads? - Mail original - De: Somnath Roy somnath@sandisk.com À: aderumier aderum...@odiso.com, Mark Nelson mnel...@redhat.com Cc: ceph-devel ceph-devel@vger.kernel.org Envoyé: Mercredi 19 Août 2015 18:27:30 Objet: RE: Ceph Hackathon: More Memory Allocator Testing I think that tcmalloc have a fixed size (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. I think it is per tcmalloc instance loaded , so, at least with num_osds * num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. Also, I think there is no point of increasing osd_op_threads as it is not in IO path anymore..Mark is using default 5:2 for shard:thread per shard.. But, yes, it could be related to number of threads OSDs are using, need to understand how jemalloc works..Also, there may be some tuning to reduce memory usage (?). Thanks Regards Somnath -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Alexandre DERUMIER Sent: Wednesday, August 19, 2015 9:06 AM To: Mark Nelson Cc: ceph-devel Subject: Re: Ceph Hackathon: More Memory Allocator Testing I was listening at the today meeting, and seem that the blocker to have jemalloc as default, is that it's used more memory by osd (around 300MB?), and some guys could have boxes with 60disks. I just wonder if the memory increase is related to osd_op_num_shards/osd_op_threads value ? Seem that as hackaton, the bench has been done on super big cpus boxed 36cores/72T, http://ceph.com/hackathon/2015-08-ceph-hammer-full-ssd.pptx with osd_op_threads = 32. I think that tcmalloc have a fixed size (TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. Maybe jemalloc allocated memory by threads. (I think guys with 60disks box, dont use ssd, so low iops by osd, and they don't need a lot of threads by osd) - Mail original - De: aderumier aderum...@odiso.com À: Mark Nelson mnel...@redhat.com Cc: ceph-devel ceph-devel@vger.kernel.org Envoyé: Mercredi 19 Août 2015 16:01:28 Objet: Re: Ceph Hackathon: More Memory Allocator Testing Thanks Marc, Results are matching exactly what I
RE: The design of the eviction improvement
I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. Looks like for an object, the head and the snapshot version have the same hobject hash. Thus we have to use the hash pair instead of just the hobject hash. But I still have two questions if we use the hash pair to represent an object. 1) Does the hash pair uniquely identify an object? That's to say, is it possible for two objects to have the same hash pair? With two hashes collisions would be rare but could happen 2) We need a way to get the full object name from the hash pair, so that we know what objects to evict. But seems like we don't have a good way to do this? Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can either embed the full ghobject_t (which means we lose the fixed-size property, and the per-object overhead goes way up.. probably from ~24 bytes to more like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash position to find the object. That's somewhat inefficient for FileStore (it'll list a directory of a hundred or so objects, probably, and iterate over them to find the right one), but for NewStore it will be quite fast (NewStore has all objects sorted into keys in rocksdb, so we just start listing at the right offset). Usually we'll get the object right off, unless there are hobject_t hash collisions (already reasonably rare since it's a 2^32 space for the pool). Given that, I would lean toward the 2-hash fixed-sized records (of these 2 options)... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader
RE: The design of the eviction improvement
Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 5:57 AM To: Wang, Zhiqiang Cc: sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I'm thinking of maintaining the lists at the PG level. That's to say, we have an active/inactive list for every PG. We can load the lists in parallel during rebooting. Also, the ~100 MB lists are split among different OSD nodes. Perhaps it does not need such long time to load them? I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes
RE: The design of the eviction improvement
Yes the cost of the insertions with the current scheme is probably prohibitive. Wouldn't it approach the same amount of time as just having atime turned on in the file system? My concern about the memory is mostly that we ensure whatever algorithm is selected degrades gracefully when you get high counts of small objects. I agree that paying $ for RAM that translates into actual performance isn't really a problem. It really boils down to your workload and access pattern. Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, July 22, 2015 2:53 PM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: Don't we need to double-index the data structure? We need it indexed by atime for the purposes of eviction, but we need it indexed by object name for the purposes of updating the list upon a usage. If you use the same approach the agent uses now (iterate over items, evict/trim anything in bottom end of observed age distribution) you can get away without the double-index. Iterating over the LSM should be quite cheap. I'd be more worried about the cost of the insertions. I'm also not sure the simplistic approach below can be generalized to something like 2Q (and certainly not something like MQ). Maybe... On the other hand, I'm not sure it is the end of the world if at the end of the day the memory requirements for a cache-tier OSD are higher and inversely proportional to the object size. We can make the OSD flush/evict more aggressively if the memory utilization (due to a high object count) gets out of hand as a safety mechanism. Paying a few extra $$ for RAM isn't the end of the world I'm guessing when the performance payoff is significant... sage Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, July 22, 2015 11:51 AM To: Allen Samuels Cc: Wang, Zhiqiang; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: RE: The design of the eviction improvement On Wed, 22 Jul 2015, Allen Samuels wrote: I'm very concerned about designing around the assumption that objects are ~1MB in size. That's probably a good assumption for block and HDFS dominated systems, but likely a very poor assumption about many object and file dominated systems. If I understand the proposals that have been discussed, each of them assumes in in-memory data structure with an entry per object (the exact size of the entry varies with the different proposals). Under that assumption, I have another concern which is the lack of graceful degradation as the object counts grow and the in-memory data structures get larger. Everything seems fine until just a few objects get added then the system starts to page and performance drops dramatically (likely) to the point where Linux will start killing OSDs. What's really needed is some kind of way to extend the lists into storage in way that's doesn't cause a zillion I/O operations. I have some vague idea that some data structure like the LSM mechanism ought to be able to accomplish what we want. Some amount of the data structure (the most likely to be used) is held in DRAM [and backed to storage for restart] and the least likely to be used is flushed to storage with some mechanism that allows batched updates. How about this: The basic mapping we want is object - atime. We keep a simple LRU of the top N objects in memory with the object-atime values. When an object is accessed, it is moved or added to the top of the list. Periodically, or when the LRU size reaches N * (1.x), we flush: - write the top N items to a compact object that can be quickly loaded - write our records for the oldest items (N .. N*1.x) to leveldb/rocksdb in a simple object - atime fashion When the agent runs, we just walk across that key range of the db the same way we currently enumerate objects. For each record we use either the stored atime or the value in the in-memory LRU (it'll need to be dual-indexed by both a list and a hash map), whichever is newer. We can use the same histogram estimation approach we do now to determine if the object in question is below the flush/evict threshold. The LSM does the work of sorting/compacting the atime info, while we avoid touching it at all for the hottest objects to keep the amount of work it has to do in check. sage
RE: The design of the eviction improvement
This seems much better than the current mechanism. Do you have an estimate of the memory consumption of the two lists? (In terms of bytes/object?) Allen Samuels Software Architect, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wang, Zhiqiang Sent: Monday, July 20, 2015 1:47 AM To: Sage Weil; sj...@redhat.com; ceph-devel@vger.kernel.org Subject: The design of the eviction improvement Hi all, This is a follow-up of one of the CDS session at http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tiering_eviction. We discussed the drawbacks of the current eviction algorithm and several ways to improve it. Seems like the LRU variants is the right way to go. I come up with some design points after the CDS, and want to discuss it with you. It is an approximate 2Q algorithm, combining some benefits of the clock algorithm, similar to what the linux kernel does for the page cache. # Design points: ## LRU lists - Maintain LRU lists at the PG level. The SharedLRU and SimpleLRU implementation in the current code have a max_size, which limits the max number of elements in the list. This mostly looks like a MRU, though its name implies they are LRUs. Since the object size may vary in a PG, it's not possible to caculate the total number of objects which the cache tier can hold ahead of time. We need a new LRU implementation with no limit on the size. - Two lists for each PG: active and inactive Objects are first put into the inactive list when they are accessed, and moved between these two lists based on some criteria. Object flag: active, referenced, unevictable, dirty. - When an object is accessed: 1) If it's not in both of the lists, it's put on the top of the inactive list 2) If it's in the inactive list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the inactive list. 3) If it's in the inactive list, and the referenced flag is set, the referenced flag is cleared, and it's removed from the inactive list, and put on top of the active list. 4) If it's in the active list, and the referenced flag is not set, the referenced flag is set, and it's moved to the top of the active list. 5) If it's in the active list, and the referenced flag is set, it's moved to the top of the active list. - When selecting objects to evict: 1) Objects at the bottom of the inactive list are selected to evict. They are removed from the inactive list. 2) If the number of the objects in the inactive list becomes low, some of the objects at the bottom of the active list are moved to the inactive list. For those objects which have the referenced flag set, they are given one more chance in the active list. They are moved to the top of the active list with the referenced flag cleared. For those objects which don't have the referenced flag set, they are moved to the inactive list, with the referenced flag set. So that they can be quickly promoted to the active list when necessary. ## Combine flush with eviction - When evicting an object, if it's dirty, it's flushed first. After flushing, it's evicted. If not dirty, it's evicted directly. - This means that we won't have separate activities and won't set different ratios for flush and evict. Is there a need to do so? - Number of objects to evict at a time. 'evict_effort' acts as the priority, which is used to calculate the number of objects to evict. ## LRU lists Snapshotting - The two lists are snapshotted persisted periodically. - Only one copy needs to be saved. The old copy is removed when persisting the lists. The saved lists are used to restore the LRU lists when OSD reboots. Any comments/feedbacks are welcomed. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Inline dedup/compression
For non-overwriting relatively large objects, this scheme works fine. Unfortunately the real use-case for deduplication is block storage with virtualized infrastructure (eliminating duplicate operating system files and applications, etc.) and in order for this to provide good deduplication, you'll need a block size that's equal or smaller than the cluster-size of the file system mounted on the block device. Meaning that your storage is now dominated by small chunks (probably 8K-ish) rather than the relatively large 4M stripes that is used today (this will also kill EC since small objects are replicated rather than ECed). This will have a massive impact on backend storage I/O as the basic data/metadata ratio is complete skewed (both for static storage and dynamic I/O count). Allen Samuels Software Architect, Emerging Storage Solutions 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Chaitanya Huilgol Sent: Thursday, July 02, 2015 3:50 AM To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang Cc: ceph-devel Subject: RE: Inline dedup/compression Hi James et.al , Here is an example for clarity, 1. Client Writes object object.abcd 2. Based on the crush rules, say OSD.a is the primary OSD which receives the write 3. OSD.a performs segmenting/fingerprinting which can be static or dynamic and generates a list of segments, the object.abcd is now represented by a manifest object with the list of segment hash and len [Header] [Seg1_sha, len] [Seg2_sha, len] ... [Seg3_sha, len] 4. OSD.a writes each segment as a new object in the cluster with object name reserved_dedupe_perfixsha 5. The dedupe object write is treated differently from regular object writes, If the object is present then an object reference count is incremented and the object is not overwritten - this forms the basis of the dedupe logic. Multiple objects with one or more same constituent segments start sharing the segment objects. 6. Once all the segments are successfully written the object 'object.abcd' is now just a stub object with the segment manifest as described above and is goes through a regular object write sequence Partial writes on objects will be complicated, - Partially affected segments will have to be read and segmentation logic has to be run from first to last affected segment boundaries - New segments will be written - Old overwritten segments have to be deleted - Write merged manifest of the object All this will need protection of the PG lock, Also additional journaling mechanism will be needed to recover from cases where the osd goes down before writing all the segments. Since this is quite a lot of processing, a better use case for this dedupe mechanism would be in the data tiering model with object redirects. The manifest object fits quiet well into object redirects scheme of things, the idea is that, when an object is moved out of the base tier, you have an option to create a dedupe stub object and write individual segments into the cold backend tier with a rados plugin. Remaining responses inline. Regards, Chaitanya -Original Message- From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com] Sent: Wednesday, July 01, 2015 4:00 AM To: Chaitanya Huilgol; Allen Samuels; Haomai Wang Cc: ceph-devel Subject: RE: Inline dedup/compression Hi Chaitanya, Very interesting thoughts. I am not sure whether I get all of them or now. Here are several questions for the solution you provided, Might be a little bit detailed. Regards, James - Dedupe is set as a pool property Write: - Write arrives at the primary OSD/pg [James] Does the OSD/PG mean PG Backend over here? [Chaitanya] I mean the Primary OSD and the PG which get selected by the crush - not the specific OSD component - Data is segmented (rabin/static) and secure hash computed [James] Which component in OSD are you going to do the data segment and hash computation? [Chaitanya] If partial writes are not supported then this could be down before acquiring the PG lock, else we need the protection of the PG lock. Probably in the do_request() path? - A manifest is created with the offset/len/hash for all the segments [James] The manifest is going to be part of xattr of object? Where are you going to save manifest? [Chaitanya] The manifest is a stub object with the constituent segments list - OSD/pg sends rados write with a special name __known__prefixsecure hash for all segments [James] What's your meaning of Rados Wirte? Where do the all segments with secure hash signature write to? [Chaitanya] All segments are unique objects with the above mentioned naming scheme, they get written back into the cluster as a regular client rados object write - PG receiving dedup write will: 1. check for object presence and create object if not present 2. If object is already present, then an reference count
RE: Inline dedup/compression
This covers the read and write, what about the delete? One of the major issues with Dedupe, whether global or local is to address the inherent ref-counting associated with sharing of pieces of storage. Allen Samuels Software Architect, Emerging Storage Solutions 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Chaitanya Huilgol Sent: Monday, June 29, 2015 11:20 PM To: James (Fei) Liu-SSI; Haomai Wang Cc: ceph-devel Subject: RE: Inline dedup/compression Below is an alternative idea at a very high level around dedup with ceph without a need of centralized hash index, - Dedupe is set as a pool property Write: - Write arrives at the primary OSD/pg - Data is segmented (rabin/static) and secure hash computed - A manifest is created with the offset/len/hash for all the segments - OSD/pg sends rados write with a special name __known__prefixsecure hash for all segments - PG receiving dedup write will: 1. check for object presence and create object if not present 2. If object is already present, then an reference count is incremented (check and increment needs to be atomic) - Response is received by original primary PG for all segments - Primary PG writes the manifest to local and replicas or EC members - Response sent to client Read: - Read received at primary PG - Reads manifest object - sends reads for each segment object __know_prefixsecure hash - coalesces all the response to build the required data - Responds to client Pros: No need of centralized hash index so inline with ceph no bottleneck philosophy Cons: Some PGs may get overloaded due to frequently occurring segment patterns Latency and increased traffic on the network Regards, Chaitanya -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei) Liu-SSI Sent: Tuesday, June 30, 2015 2:25 AM To: Haomai Wang Cc: ceph-devel Subject: RE: Inline dedup/compression Hi Haomai, Thanks for moving the idea forward. Regarding to the compression. However, if we do compression on the client level, it is not global. And the compression was only applied to the local client, am I right? I think there is pros and cons in two solutions and we can get into details more for each solution. I really like your idea for dedupe in OSD side by the way. Let me think more about it. Regards, James -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Friday, June 26, 2015 8:55 PM To: James (Fei) Liu-SSI Cc: ceph-devel Subject: Re: Inline dedup/compression On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI james@ssi.samsung.com wrote: Hi Haomai, Thanks for your response as always. I agree compression is comparable easier task but still very challenge in terms of implementation no matter where we should implement . Client side like RBD, or RDBGW or CephFS, or PG should be a little bit better place to implementation in terms of efficiency and cost reduction before the data were duplicated to other OSDs. It has two reasons : 1. Keep the data consistency among OSDs in one PG 2. Saving the computing resources IMHO , The compression should be accomplished before the replication come into play in pool level. However, we can also have second level of compression in the local objectstore. In term of unit size of compression , It really depends workload and in which layer we should implement. About inline deduplication, it will dramatically increase the complexities if we bring in the replication and Erasure Coding for consideration. However, Before we talk about implementation, It would be great if we can understand the pros and cons to implement inline dedupe/compression. We all understand the benefits of dedupe/compression. However, the side effect is performance hurt and need more computing resources. It would be great if we can understand the problems from 30,000 feet high for the whole picture about the Ceph. Please correct me if I were wrong. Actually we may have some tricks to reduce performance hurt like compression. As Joe mentioned, we can compress slave pg data to avoid performance hurt, but it may increase the complexity of recovery and pg remap things. Another in-detail implement way if we begin to compress data from messenger, osd thread and pg thread won't access data for normal client op, so maybe we can make it parallel with pg process. Journal thread will get the compressed data at last. The effect of compression also is a concern, we do compression in rados may not get the best compression result. If we can do compression in libcephfs, librbd and radosgw and make rados unknown to compression, it maybe simpler and we can get file/block/object level compression. it should be better? About
RE: Regarding key/value interface
Another thing we're looking into is compression. The intersection of compression and object striping (fracturing) is interesting. Is the striping variable on a per-object basis? Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Thursday, September 11, 2014 6:55 PM To: Somnath Roy Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; ceph-devel@vger.kernel.org Subject: RE: Regarding key/value interface On Fri, 12 Sep 2014, Somnath Roy wrote: Make perfect sense Sage.. Regarding striping of filedata, You are saying KeyValue interface will do the following for me? 1. Say in case of rbd image of order 4 MB, a write request coming to Key/Value interface, it will chunk the object (say full 4MB) in smaller sizes (configurable ?) and stripe it as multiple key/value pair ? 2. Also, while reading it will take care of accumulating and send it back. Precisely. A smarter thing we might want to make it do in the future would be to take a 4 KB write create a new key that logically overwrites part of the larger, say, 1MB key, and apply it on read. And maybe give up and rewrite the entire 1MB stripe after too many small overwrites have accumulated. Something along those lines to reduce the cost of small IOs to large objects. sage Thanks Regards Somnath -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, September 11, 2014 6:31 PM To: Somnath Roy Cc: Haomai Wang (haomaiw...@gmail.com); ceph-us...@lists.ceph.com; ceph-devel@vger.kernel.org Subject: Re: Regarding key/value interface Hi Somnath, On Fri, 12 Sep 2014, Somnath Roy wrote: Hi Sage/Haomai, If I have a key/value backend that support transaction, range queries (and I don?t need any explicit caching etc.) and I want to replace filestore (and leveldb omap) with that, which interface you recommend me to derive from , directly ObjectStore or KeyValueDB ? I have already integrated this backend by deriving from ObjectStore interfaces earlier (pre keyvalueinteface days) but not tested thoroughly enough to see what functionality is broken (Basic functionalities of RGW/RBD are working fine). Basically, I want to know what are the advantages (and disadvantages) of deriving it from the new key/value interfaces ? Also, what state is it in ? Is it feature complete and supporting all the ObjectStore interfaces like clone and all ? Everything is supported, I think, for perhaps some IO hints that don't make sense in a k/v context. The big things that you get by using KeyValueStore and plugging into the lower-level interface are: - striping of file data across keys - efficient clone - a zillion smaller methods that aren't conceptually difficult to implement bug tedious and to do so. The other nice thing about reusing this code is that you can use a leveldb or rocksdb backend as a reference for testing or performance or whatever. The main thing that will be a challenge going forward, I predict, is making storage of the object byte payload in key/value pairs efficient. I think KeyValuestore is doing some simple striping, but it will suffer for small overwrites (like 512-byte or 4k writes from an RBD). There are probably some pretty simple heuristics and tricks that can be done to mitigate the most common patterns, but there is no simple solution since the backends generally don't support partial value updates (I assume yours doesn't either?). But, any work done here will benefit the other backends too so that would be a win.. sage PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose
You talk about restting the object map on a restart after a crash -- I assume you mean rebuilding, how long will this take? --- The true mystery of the world is the visible, not the invisible. Oscar Wilde (1854 - 1900) Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang Sent: Thursday, June 05, 2014 12:43 AM To: Wido den Hollander Cc: Sage Weil; Josh Durgin; ceph-devel@vger.kernel.org Subject: Re: [Feature]Proposal for adding a new flag named shared to support performance and statistic purpose On Thu, Jun 5, 2014 at 3:25 PM, Wido den Hollander w...@42on.com wrote: On 06/05/2014 09:01 AM, Haomai Wang wrote: Hi, Previously I sent a mail about the difficult of rbd snapshot size statistic. The main solution is using object map to store the changes. The problem is we can't handle with multi client concurrent modify. Lack of object map(like pointer map in qcow2), it cause many problems in librbd. Such as clone depth, the deep clone depth will cause remarkable latency. Usually each clone wrap will increase two times of latency. I consider to make a tradeoff between multi-client support and single-client support for librbd. In practice, most of the volumes/images are used by VM, there only exist one client will access/modify image. We can't only want to make shared image possible but make most of use cases bad. So we can add a new flag called shared when creating image. If shared is false, librbd will maintain a object map for each image. The object map is considered to durable, each image_close call will store the map into rados. If the client is crashed and failed to dump the object map, the next client open the image will think the object map as out of date and reset the objectmap. Why not flush out the object map every X period? Assume a client runs for weeks or months and you would keep that map in memory all the time since the image is never closed. Yes, as a period job is also a good alter We can easily find the advantage of this feature: 1. Avoid clone performance problem 2. Make snapshot statistic possible 3. Improve librbd operation performance including read, copy-on-write operation. What do you think above? More feedbacks are appreciate! -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
RE: RBD thoughts
Ok, now I think I understand. Essentially, you have a write-ahead log + lazy application of the log to the backend + code that correctly deals with the RAW hazard (same as Cassandra, FileStore, LevelDB, etc.). Correct? So every block write is done three times, once for the replication journal, once in the FileStore journal and once in the target file system. Correct? Also, if I understand the architecture, you'll be moving the data over the network at least one more time (* # of replicas). Correct? This seems VERY expensive in system resources, though I agree it's a simpler implementation task. --- Never put off until tomorrow what you can do the day after tomorrow. Mark Twain Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Wednesday, May 07, 2014 9:24 AM To: Allen Samuels Cc: ceph-devel@vger.kernel.org Subject: RE: RBD thoughts On Wed, 7 May 2014, Allen Samuels wrote: Sage wrote: Allen wrote: I was looking over the CDS for Giant and was paying particular attention to the rbd journaling stuff. Asynchronous geo-replications for block devices is really a key for enterprise deployment and this is the foundational element of that. It?s an area that we are keenly interested in and would be willing to devote development resources toward. It wasn?t clear from the recording whether this was just musings or would actually be development for Giant, but when you get your head above water w.r.t. the acquisition I?d like to investigate how we (Sandisk) could help turn this into a real project. IMO, this is MUCH more important than CephFS stuff for penetrating enterprises. The blueprint suggests the creation of an additional journal for the block device and that this journal would track metadata changes and potentially record overwritten data (without the overwritten data you can only sync to snapshots ? which will be reasonable functionality for some use-cases). It seems to me that this probably doesn?t work too well. Wouldn?t it be the case that you really want to commit to the journal AND to the block device atomically? That?s really problematic with the current RADOS design as the separate journal would be in a separate PG from the target block and likely on a separate OSD. Now you have all sorts of cases of crashes/updates where the journal and the target block are out of sync. The idea is to make it a write-ahead journal, which avoids any need for atomicity. The writes are streamed to the journal, and applied to the rbd image proper only after they commit there. Since block operations are effeictively idempotent (you can replay the journal from any point and the end result is always the same) the recovery case is pretty simple. Who is responsible for the block device part of the commit?. If it's the RBD code rather than the OSD, then I think there's a dangerous failure case where the journal commits and then the client crashes and the journal-based replication system ends up replicating the last (un-performed) write operation. If it's the OSDs that are responsible, then this is not an issue. The idea is to use the usual set of write-ahead journaling tricks: we write first to the journal, then to the device, and lazily update a pointer indicating which journal events have been applied. After a crash, the new client will reapply anything in the journal after that point to ensure the device is in sync. While the device is in active use, we'd need to track which writes have not yet been applied to the device so we can delay a read following a recent write until it is applied. (This should be very rare, given that the file system sitting on top of the device is generally doing all sorts of caching.) This only works, of course, for use-cases where there is a single active writer for the device. That means it's usable for local file systems like ext3/4 and xfs, but not for someting like ocfs2. Similarly, I don't think the snapshot limitation is there; you can simply note the journal offset, then copy the image (in a racy way), and then replay the journal from that position to capture the recent updates. w.r.t. snapshots and non-old-data-preserving journaling mode, How will you deal with the race between reading the head of the journal and reading the data referenced by that head of the journal that could be over-written by a write operation before you can actually read it? Oh, I think I'm using different terminology. I'm assuming that the journal includes the *new* data (ala data=journal mode for ext*). We talked a bit at CDS about an optional separate journal with overwritten data so that you could 'rewind' activity
RE: RBD thoughts
The extra network move that I was referring to would be local, i.e., from the node containing the write-ahead journal to the nodes containing the destination objects. I wasn't counting any geo-replication, that would be yet another network move. --- Now I know what a statesman is; he's a dead politician. We need more statesmen. Bob Edwards Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Wednesday, May 07, 2014 12:33 PM To: Allen Samuels Cc: ceph-devel@vger.kernel.org Subject: RE: RBD thoughts On Wed, 7 May 2014, Allen Samuels wrote: Ok, now I think I understand. Essentially, you have a write-ahead log + lazy application of the log to the backend + code that correctly deals with the RAW hazard (same as Cassandra, FileStore, LevelDB, etc.). Correct? Right. So every block write is done three times, once for the replication journal, once in the FileStore journal and once in the target file system. Correct? More than that, actually. With the FileStore backend, every write is done 2x. The rbd journal would be on top of rados objects, so that's 2*2. But that cost goes away with an improved backend that doesn't need a journal (like the kv backend or f2fs). Also, if I understand the architecture, you'll be moving the data over the network at least one more time (* # of replicas). Correct? Right; this would be mirrored in the target cluster, probably in another data center. This seems VERY expensive in system resources, though I agree it's a simpler implementation task. It's certainly not free. :) sage --- Never put off until tomorrow what you can do the day after tomorrow. Mark Twain Allen Samuels Chief Software Architect, Emerging Storage Solutions 951 SanDisk Drive, Milpitas, CA 95035 T: +1 408 801 7030| M: +1 408 780 6416 allen.samu...@sandisk.com -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Wednesday, May 07, 2014 9:24 AM To: Allen Samuels Cc: ceph-devel@vger.kernel.org Subject: RE: RBD thoughts On Wed, 7 May 2014, Allen Samuels wrote: Sage wrote: Allen wrote: I was looking over the CDS for Giant and was paying particular attention to the rbd journaling stuff. Asynchronous geo-replications for block devices is really a key for enterprise deployment and this is the foundational element of that. It?s an area that we are keenly interested in and would be willing to devote development resources toward. It wasn?t clear from the recording whether this was just musings or would actually be development for Giant, but when you get your head above water w.r.t. the acquisition I?d like to investigate how we (Sandisk) could help turn this into a real project. IMO, this is MUCH more important than CephFS stuff for penetrating enterprises. The blueprint suggests the creation of an additional journal for the block device and that this journal would track metadata changes and potentially record overwritten data (without the overwritten data you can only sync to snapshots ? which will be reasonable functionality for some use-cases). It seems to me that this probably doesn?t work too well. Wouldn?t it be the case that you really want to commit to the journal AND to the block device atomically? That?s really problematic with the current RADOS design as the separate journal would be in a separate PG from the target block and likely on a separate OSD. Now you have all sorts of cases of crashes/updates where the journal and the target block are out of sync. The idea is to make it a write-ahead journal, which avoids any need for atomicity. The writes are streamed to the journal, and applied to the rbd image proper only after they commit there. Since block operations are effeictively idempotent (you can replay the journal from any point and the end result is always the same) the recovery case is pretty simple. Who is responsible for the block device part of the commit?. If it's the RBD code rather than the OSD, then I think there's a dangerous failure case where the journal commits and then the client crashes and the journal-based replication system ends up replicating the last (un-performed) write operation. If it's the OSDs that are responsible, then this is not an issue. The idea is to use the usual set of write-ahead journaling tricks: we write first to the journal, then to the device, and lazily update a pointer indicating which journal events have been applied. After a crash, the new client will reapply anything in the journal after that point