Re: [DISCUSS] CEP-15: General Purpose Transactions

Blake Eggleston Wed, 29 Sep 2021 21:47:49 -0700

You could establish a lower timestamp bound and buffer transaction state on the 
coordinator, then make the commit an operation that only applies if all 
partitions involved haven’t been changed by a more recent timestamp. You could 
also implement mvcc either in the storage layer or for some period of time by 
buffering commits on each replica before applying.


> On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <[email protected]> wrote:
> 
> How are interactive transactions possible with Accord?
> 
> 
> 
> On Tue, Sep 21, 2021 at 11:56 PM [email protected] <[email protected]>
> wrote:
> 
>> Could you explain why you believe this trade-off is necessary? We can
>> support full SQL just fine with Accord, and I hope that we eventually do so.
>> 
>> This domain is incredibly complex, so it is easy to reach wrong
>> conclusions. I would invite you again to propose a system for discussion
>> that you think offers something Accord is unable to, and that you consider
>> desirable, and we can work from there.
>> 
>> To pre-empt some possible discussions, I am not aware of anything we
>> cannot do with Accord that we could do with either Calvin or Spanner.
>> Interactive transactions are possible on top of Accord, as are transactions
>> with an unknown read/write set. In each case the only cost is that they
>> would use optimistic concurrency control, which is no worse the spanner
>> derivatives anyway (which I have to assume is your benchmark in this
>> regard). I do not expect to deliver either functionality initially, but
>> Accord takes us most of the way there for both.
>> 
>> 
>> From: Jonathan Ellis <[email protected]>
>> Date: Wednesday, 22 September 2021 at 05:36
>> To: dev <[email protected]>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Right, I'm looking for exactly a discussion on the high level goals.
>> Instead of saying "here's the goals and we ruled out X because Y" we should
>> start with a discussion around, "Approach A allows X and W, approach B
>> allows Y and Z" and decide together what the goals should be and and what
>> we are willing to trade to get those goals, e.g., are we willing to give up
>> global strict serializability to get the ability to support full SQL.  Both
>> of these are nice to have!
>> 
>> On Tue, Sep 21, 2021 at 9:52 PM [email protected] <[email protected]>
>> wrote:
>> 
>>> Hi Jonathan,
>>> 
>>> These other systems are incompatible with the goals of the CEP. I do
>>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
>>> summarise that discussion below. A true and accurate comparison of these
>>> other systems is essentially intractable, as there are complex subtleties
>>> to each flavour, and those who are interested would be better served by
>>> performing their own research.
>>> 
>>> I think it is more productive to focus on what we want to achieve as a
>>> community. If you believe the goals of this CEP are wrong for the
>> project,
>>> let’s focus on that. If you want to compare and contrast specific facets
>> of
>>> alternative systems that you consider to be preferable in some dimension,
>>> let’s do that here or in a Q&A as proposed by Joey.
>>> 
>>> The relevant goals are that we:
>>> 
>>> 
>>>  1.  Guarantee strict serializable isolation on commodity hardware
>>>  2.  Scale to any cluster size
>>>  3.  Achieve optimal latency
>>> 
>>> The approach taken by Spanner derivatives is rejected by (1) because they
>>> guarantee only Serializable isolation (they additionally fail (3)). From
>>> watching talks by YugaByte, and inferring from Cockroach’s
>>> panic-cluster-death under clock skew, this is clearly considered by
>>> everyone to be undesirable but necessary to achieve scalability.
>>> 
>>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
>>> sequencing layer requires a global leader process for the cluster, which
>> is
>>> incompatible with Cassandra’s scalability requirements. It additionally
>>> fails (3) for global clients.
>>> 
>>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
>>> Spanner clone for its multi-key transaction functionality, not 2PC.
>>> 
>>> Systems such as RAMP with even weaker isolation are not considered for
>> the
>>> simple reason that they do not even claim to meet (1).
>>> 
>>> If we want to additionally offer weaker isolation levels than
>>> Serializable, such as that provided by the recent RAMP-TAO paper,
>> Cassandra
>>> is likely able to support multiple distinct transaction layers that
>> operate
>>> independently. I would encourage you to file a CEP to explore how we can
>>> meet these distinct use cases, but I consider them to be niche. I expect
>>> that a majority of our user base desire strict serializable isolation,
>> and
>>> certainly no less than serializable isolation, to augment the existing
>>> weaker isolation offered by quorum reads and writes.
>>> 
>>> I would tangentially note that we are not an AP database under normal
>>> recommended operation. A minority in any network partition cannot reach
>>> QUORUM, so under recommended usage we are a high-availability leaderless
>> CP
>>> database.
>>> 
>>> 
>>> From: Jonathan Ellis <[email protected]>
>>> Date: Tuesday, 21 September 2021 at 23:45
>>> To: dev <[email protected]>
>>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>>> Benedict, thanks for taking the lead in putting this together. Since
>>> Cassandra is the only relevant database today designed around a
>> leaderless
>>> architecture, it's quite likely that we'll be better served with a custom
>>> transaction design instead of trying to retrofit one from CP systems.
>>> 
>>> The whitepaper here is a good description of the consensus algorithm
>> itself
>>> as well as its robustness and stability characteristics, and its
>> comparison
>>> with other state-of-the-art consensus algorithms is very useful.  In the
>>> context of Cassandra, where a consensus algorithm is only part of what
>> will
>>> be implemented, I'd like to see a more complete evaluation of the
>>> transactional side of things as well, including performance
>> characteristics
>>> as well as the types of transactions that can be supported and at least a
>>> general idea of what it would look like applied to Cassandra. This will
>>> allow the PMC to make a more informed decision about what tradeoffs are
>>> best for the entire long-term project of first supplementing and
>> ultimately
>>> replacing LWT.
>>> 
>>> (Allowing users to mix LWT and AP Cassandra operations against the same
>>> rows was probably a mistake, so in contrast with LWT we’re not looking
>> for
>>> something fast enough for occasional use but rather something within a
>>> reasonable factor of AP operations, appropriate to being the only way to
>>> interact with tables declared as such.)
>>> 
>>> Besides Accord, this should cover
>>> 
>>> - Calvin and FaunaDB
>>> - A Spanner derivative (no opinion on whether that should be Cockroach or
>>> Yugabyte, I don’t think it’s necessary to cover both)
>>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
>>> there is more public information about MongoDB)
>>> - RAMP
>>> 
>>> Here’s an example of what I mean:
>>> 
>>> =Calvin=
>>> 
>>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
>>> transactions, then replicas execute the transactions independently with
>> no
>>> further coordination.  No SPOF.  Transactions are batched by each
>> sequencer
>>> to keep this from becoming a bottleneck.
>>> 
>>> Performance: Calvin paper (published 2012) reports linear scaling of
>> TPC-C
>>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
>>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
>>> of four reads and four writes, so this is effectively 2M reads and 2M
>>> writes as we normally measure them in C*.
>>> 
>>> Calvin supports mixed read/write transactions, but because the
>> transaction
>>> execution logic requires knowing all partition keys in advance to ensure
>>> that all replicas can reproduce the same results with no coordination,
>>> reads against non-PK predicates must be done ahead of time
>> (transparently,
>>> by the server) to determine the set of keys, and this must be retried if
>>> the set of rows affected is updated before the actual transaction
>> executes.
>>> 
>>> Batching and global consensus adds latency -- 100ms in the Calvin paper
>> and
>>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
>>> (including multi-partition updates) are equally performant in Calvin
>> since
>>> the coordination is handled up front in the sequencing step.  Glass half
>>> empty: even single-row reads and writes have to pay the full coordination
>>> cost.  Fauna has optimized this away for reads but I am not aware of a
>>> description of how they changed the design to allow this.
>>> 
>>> Functionality and limitations: since the entire transaction must be known
>>> in advance to allow coordination-less execution at the replicas, Calvin
>>> cannot support interactive transactions at all.  FaunaDB mitigates this
>> by
>>> allowing server-side logic to be included, but a Calvin approach will
>> never
>>> be able to offer SQL compatibility.
>>> 
>>> Guarantees: Calvin transactions are strictly serializable.  There is no
>>> additional complexity or performance hit to generalizing to multiple
>>> regions, apart from the speed of light.  And since Calvin is already
>> paying
>>> a batching latency penalty, this is less painful than for other systems.
>>> 
>>> Application to Cassandra: B-.  Distributed transactions are handled by
>> the
>>> sequencing and scheduling layers, which are leaderless, and Calvin’s
>>> requirements for the storage layer are easily met by C*.  But Calvin also
>>> requires a global consensus protocol and LWT is almost certainly not
>>> sufficiently performant, so this would require ZK or etcd (reasonable
>> for a
>>> library approach but not for replacing LWT in C* itself), or an
>>> implementation of Accord.  I don’t believe Calvin would require
>> additional
>>> table-level metadata in Cassandra.
>>> 
>>> On Sun, Sep 5, 2021 at 9:33 AM [email protected] <[email protected]>
>>> wrote:
>>> 
>>>> Wiki:
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>>>> Whitepaper:
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>>>> <
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>>>> 
>>>> Prototype: https://github.com/belliottsmith/accord
>>>> 
>>>> Hi everyone, I’d like to propose this CEP for adoption by the
>> community.
>>>> 
>>>> Cassandra has benefitted from LWTs for many years, but application
>>>> developers that want to ensure consistency for complex operations must
>>>> either accept the scalability bottleneck of serializing all related
>> state
>>>> through a single partition, or layer a complex state machine on top of
>>> the
>>>> database. These are sophisticated and costly activities that our users
>>>> should not be expected to undertake. Since distributed databases are
>>>> beginning to offer distributed transactions with fewer caveats, it is
>>> past
>>>> time for Cassandra to do so as well.
>>>> 
>>>> This CEP proposes the use of several novel techniques that build upon
>>>> research (that followed EPaxos) to deliver (non-interactive) general
>>>> purpose distributed transactions. The approach is outlined in the
>>> wikipage
>>>> and in more detail in the linked whitepaper. Importantly, by adopting
>>> this
>>>> approach we will be the _only_ distributed database to offer global,
>>>> scalable, strict serializable transactions in one wide area round-trip.
>>>> This would represent a significant improvement in the state of the art,
>>>> both in the academic literature and in commercial or open source
>>> offerings.
>>>> 
>>>> This work has been partially realised in a prototype. This partial
>>>> prototype has been verified against Jepsen.io’s Maelstrom library and
>>>> dedicated in-tree strict serializability verification tools, but much
>>> work
>>>> remains for the work to be production capable and integrated into
>>> Cassandra.
>>>> 
>>>> I propose including the prototype in the project as a new source
>>>> repository, to be developed as a standalone library for integration
>> into
>>>> Cassandra. I hope the community sees the important value proposition of
>>>> this proposal, and will adopt the CEP after this discussion, so that
>> the
>>>> library and its integration into Cassandra can be developed in parallel
>>> and
>>>> with the involvement of the wider community.
>>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>> 
> 
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to