You could establish a lower timestamp bound and buffer transaction state on the coordinator, then make the commit an operation that only applies if all partitions involved haven’t been changed by a more recent timestamp. You could also implement mvcc either in the storage layer or for some period of time by buffering commits on each replica before applying.
> On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > How are interactive transactions possible with Accord? > > > > On Tue, Sep 21, 2021 at 11:56 PM bened...@apache.org <bened...@apache.org> > wrote: > >> Could you explain why you believe this trade-off is necessary? We can >> support full SQL just fine with Accord, and I hope that we eventually do so. >> >> This domain is incredibly complex, so it is easy to reach wrong >> conclusions. I would invite you again to propose a system for discussion >> that you think offers something Accord is unable to, and that you consider >> desirable, and we can work from there. >> >> To pre-empt some possible discussions, I am not aware of anything we >> cannot do with Accord that we could do with either Calvin or Spanner. >> Interactive transactions are possible on top of Accord, as are transactions >> with an unknown read/write set. In each case the only cost is that they >> would use optimistic concurrency control, which is no worse the spanner >> derivatives anyway (which I have to assume is your benchmark in this >> regard). I do not expect to deliver either functionality initially, but >> Accord takes us most of the way there for both. >> >> >> From: Jonathan Ellis <jbel...@gmail.com> >> Date: Wednesday, 22 September 2021 at 05:36 >> To: dev <dev@cassandra.apache.org> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions >> Right, I'm looking for exactly a discussion on the high level goals. >> Instead of saying "here's the goals and we ruled out X because Y" we should >> start with a discussion around, "Approach A allows X and W, approach B >> allows Y and Z" and decide together what the goals should be and and what >> we are willing to trade to get those goals, e.g., are we willing to give up >> global strict serializability to get the ability to support full SQL. Both >> of these are nice to have! >> >> On Tue, Sep 21, 2021 at 9:52 PM bened...@apache.org <bened...@apache.org> >> wrote: >> >>> Hi Jonathan, >>> >>> These other systems are incompatible with the goals of the CEP. I do >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will >>> summarise that discussion below. A true and accurate comparison of these >>> other systems is essentially intractable, as there are complex subtleties >>> to each flavour, and those who are interested would be better served by >>> performing their own research. >>> >>> I think it is more productive to focus on what we want to achieve as a >>> community. If you believe the goals of this CEP are wrong for the >> project, >>> let’s focus on that. If you want to compare and contrast specific facets >> of >>> alternative systems that you consider to be preferable in some dimension, >>> let’s do that here or in a Q&A as proposed by Joey. >>> >>> The relevant goals are that we: >>> >>> >>> 1. Guarantee strict serializable isolation on commodity hardware >>> 2. Scale to any cluster size >>> 3. Achieve optimal latency >>> >>> The approach taken by Spanner derivatives is rejected by (1) because they >>> guarantee only Serializable isolation (they additionally fail (3)). From >>> watching talks by YugaByte, and inferring from Cockroach’s >>> panic-cluster-death under clock skew, this is clearly considered by >>> everyone to be undesirable but necessary to achieve scalability. >>> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its >>> sequencing layer requires a global leader process for the cluster, which >> is >>> incompatible with Cassandra’s scalability requirements. It additionally >>> fails (3) for global clients. >>> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a >>> Spanner clone for its multi-key transaction functionality, not 2PC. >>> >>> Systems such as RAMP with even weaker isolation are not considered for >> the >>> simple reason that they do not even claim to meet (1). >>> >>> If we want to additionally offer weaker isolation levels than >>> Serializable, such as that provided by the recent RAMP-TAO paper, >> Cassandra >>> is likely able to support multiple distinct transaction layers that >> operate >>> independently. I would encourage you to file a CEP to explore how we can >>> meet these distinct use cases, but I consider them to be niche. I expect >>> that a majority of our user base desire strict serializable isolation, >> and >>> certainly no less than serializable isolation, to augment the existing >>> weaker isolation offered by quorum reads and writes. >>> >>> I would tangentially note that we are not an AP database under normal >>> recommended operation. A minority in any network partition cannot reach >>> QUORUM, so under recommended usage we are a high-availability leaderless >> CP >>> database. >>> >>> >>> From: Jonathan Ellis <jbel...@gmail.com> >>> Date: Tuesday, 21 September 2021 at 23:45 >>> To: dev <dev@cassandra.apache.org> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions >>> Benedict, thanks for taking the lead in putting this together. Since >>> Cassandra is the only relevant database today designed around a >> leaderless >>> architecture, it's quite likely that we'll be better served with a custom >>> transaction design instead of trying to retrofit one from CP systems. >>> >>> The whitepaper here is a good description of the consensus algorithm >> itself >>> as well as its robustness and stability characteristics, and its >> comparison >>> with other state-of-the-art consensus algorithms is very useful. In the >>> context of Cassandra, where a consensus algorithm is only part of what >> will >>> be implemented, I'd like to see a more complete evaluation of the >>> transactional side of things as well, including performance >> characteristics >>> as well as the types of transactions that can be supported and at least a >>> general idea of what it would look like applied to Cassandra. This will >>> allow the PMC to make a more informed decision about what tradeoffs are >>> best for the entire long-term project of first supplementing and >> ultimately >>> replacing LWT. >>> >>> (Allowing users to mix LWT and AP Cassandra operations against the same >>> rows was probably a mistake, so in contrast with LWT we’re not looking >> for >>> something fast enough for occasional use but rather something within a >>> reasonable factor of AP operations, appropriate to being the only way to >>> interact with tables declared as such.) >>> >>> Besides Accord, this should cover >>> >>> - Calvin and FaunaDB >>> - A Spanner derivative (no opinion on whether that should be Cockroach or >>> Yugabyte, I don’t think it’s necessary to cover both) >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect >>> there is more public information about MongoDB) >>> - RAMP >>> >>> Here’s an example of what I mean: >>> >>> =Calvin= >>> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order >>> transactions, then replicas execute the transactions independently with >> no >>> further coordination. No SPOF. Transactions are batched by each >> sequencer >>> to keep this from becoming a bottleneck. >>> >>> Performance: Calvin paper (published 2012) reports linear scaling of >> TPC-C >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines >>> with 7GB ram and 8 virtual cores). Note that TPC-C New Order is composed >>> of four reads and four writes, so this is effectively 2M reads and 2M >>> writes as we normally measure them in C*. >>> >>> Calvin supports mixed read/write transactions, but because the >> transaction >>> execution logic requires knowing all partition keys in advance to ensure >>> that all replicas can reproduce the same results with no coordination, >>> reads against non-PK predicates must be done ahead of time >> (transparently, >>> by the server) to determine the set of keys, and this must be retried if >>> the set of rows affected is updated before the actual transaction >> executes. >>> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper >> and >>> apparently about 50ms in FaunaDB. Glass half full: all transactions >>> (including multi-partition updates) are equally performant in Calvin >> since >>> the coordination is handled up front in the sequencing step. Glass half >>> empty: even single-row reads and writes have to pay the full coordination >>> cost. Fauna has optimized this away for reads but I am not aware of a >>> description of how they changed the design to allow this. >>> >>> Functionality and limitations: since the entire transaction must be known >>> in advance to allow coordination-less execution at the replicas, Calvin >>> cannot support interactive transactions at all. FaunaDB mitigates this >> by >>> allowing server-side logic to be included, but a Calvin approach will >> never >>> be able to offer SQL compatibility. >>> >>> Guarantees: Calvin transactions are strictly serializable. There is no >>> additional complexity or performance hit to generalizing to multiple >>> regions, apart from the speed of light. And since Calvin is already >> paying >>> a batching latency penalty, this is less painful than for other systems. >>> >>> Application to Cassandra: B-. Distributed transactions are handled by >> the >>> sequencing and scheduling layers, which are leaderless, and Calvin’s >>> requirements for the storage layer are easily met by C*. But Calvin also >>> requires a global consensus protocol and LWT is almost certainly not >>> sufficiently performant, so this would require ZK or etcd (reasonable >> for a >>> library approach but not for replacing LWT in C* itself), or an >>> implementation of Accord. I don’t believe Calvin would require >> additional >>> table-level metadata in Cassandra. >>> >>> On Sun, Sep 5, 2021 at 9:33 AM bened...@apache.org <bened...@apache.org> >>> wrote: >>> >>>> Wiki: >>>> >>> >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions >>>> Whitepaper: >>>> >>> >> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf >>>> < >>>> >>> >> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2 >>>>> >>>> Prototype: https://github.com/belliottsmith/accord >>>> >>>> Hi everyone, I’d like to propose this CEP for adoption by the >> community. >>>> >>>> Cassandra has benefitted from LWTs for many years, but application >>>> developers that want to ensure consistency for complex operations must >>>> either accept the scalability bottleneck of serializing all related >> state >>>> through a single partition, or layer a complex state machine on top of >>> the >>>> database. These are sophisticated and costly activities that our users >>>> should not be expected to undertake. Since distributed databases are >>>> beginning to offer distributed transactions with fewer caveats, it is >>> past >>>> time for Cassandra to do so as well. >>>> >>>> This CEP proposes the use of several novel techniques that build upon >>>> research (that followed EPaxos) to deliver (non-interactive) general >>>> purpose distributed transactions. The approach is outlined in the >>> wikipage >>>> and in more detail in the linked whitepaper. Importantly, by adopting >>> this >>>> approach we will be the _only_ distributed database to offer global, >>>> scalable, strict serializable transactions in one wide area round-trip. >>>> This would represent a significant improvement in the state of the art, >>>> both in the academic literature and in commercial or open source >>> offerings. >>>> >>>> This work has been partially realised in a prototype. This partial >>>> prototype has been verified against Jepsen.io’s Maelstrom library and >>>> dedicated in-tree strict serializability verification tools, but much >>> work >>>> remains for the work to be production capable and integrated into >>> Cassandra. >>>> >>>> I propose including the prototype in the project as a new source >>>> repository, to be developed as a standalone library for integration >> into >>>> Cassandra. I hope the community sees the important value proposition of >>>> this proposal, and will adopt the CEP after this discussion, so that >> the >>>> library and its integration into Cassandra can be developed in parallel >>> and >>>> with the involvement of the wider community. >>>> >>> >>> >>> -- >>> Jonathan Ellis >>> co-founder, http://www.datastax.com >>> @spyced >>> >> >> >> -- >> Jonathan Ellis >> co-founder, http://www.datastax.com >> @spyced >> > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org