Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Sun, Sep 5, 2021 at 9:33 AM bened...@apache.org <bened...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to