Re: [DISCUSS] CEP-15: General Purpose Transactions

Joseph Lynch Sat, 09 Oct 2021 07:11:32 -0700

> With the proposal hitting the one-month mark, the contributors are interested 
> in gauging the developer community's response to the proposal.


I support this proposal. From what I can understand, this proposal
moves us towards having the building blocks we need to correctly
deliver some of the most often requested features in Cassandra. For
example it seems to unlock: batches that actually work, registers that
offer fast compare and swap, global secondary indices that can be
correctly maintained, and more. Therefore, given the benefit to the
community, I support working towards that foundation that will allow
us to build solutions in Cassandra that pay consensus closer to
mutation instead of lazily at read/repair time.

I think the feedback in this thread around interface (what statements
will this facilitate and how will the library integrate with Cassandra
itself), performance (how fast will these transactions be, will we
offer bounded stale reads, etc ...), and implementation (how does this
compare/contrast with other consensus approaches) has been
informative, but at this point I think it makes sense to start trying
to make incremental progress towards a functional integration to
discover any remaining areas for improvement.

Cheers and thank you!
-Joey



On Thu, Oct 7, 2021 at 10:51 AM C. Scott Andreas <sc...@paradoxica.net> wrote:
>
> Hi Jonathan,
>
> Following up on my message yesterday as it looks like our replies may have 
> crossed en route.
>
> Thanks for bumping your message from earlier in our discussion. I believe we 
> have addressed most of these questions on the thread, in addition to offering 
> a presentation on this and related work at ApacheCon, a discussion hosted 
> following that presentation at ApacheCon, and in ASF Slack. Contributors have 
> further offered an opportuntity to discuss specific questions via 
> videoconference if it helps to speak live. I'd be happy to do so as well.
>
> Since your original message, discussion has covered a lot of ground on the 
> related databases you've mentioned:
> – Henrik has shared expertise related to MongoDB and its implementation.
> – You've shared an overview of Calvin.
> – Alex Miller has helped us review the work relative to other Paxos 
> algorithms and identified a few great enhancements to incorporate.
> – The paper discusses related approaches in FoundationDB, CockroachDB, and 
> Yugabyte.
> – Subsequent discussion has contrasted the implementation to DynamoDB, Google 
> Cloud BigTable, and Google Cloud Spanner (noting specifically that the 
> protocol achieves Spanner's 1x round-trip without requiring specialized 
> hardware).
>
> In my reply yesterday, I've attempted to crystallize what becomes possible 
> via CQL: one-shot multi-partition transactions in the first implementation 
> and a 4x latency reduction on writes / 2x latency reduction on reads relative 
> to today; along with the ability to build upon this work to enable 
> interactive transactions in the future.
>
> I believe we've exercised the questions you've raised and am grateful for the 
> ground we've covered. If you have further questions that are difficult to 
> exercise via email, please let me know if you'd like to arrange a call 
> (open-invite); we'd be happy to discuss live as well.
>
> With the proposal hitting the one-month mark, the contributors are interested 
> in gauging the developer community's response to the proposal. We warrant our 
> ability to focus durably on the project; execute this development on ASF JIRA 
> in collaboration with other contributors; engage with members of the 
> developer and user community on feedback, enhancements, and bugs; and intend 
> deliver it to completion at a standard of readiness suitable for production 
> transactional systems of record.
>
> Thanks,
>
> – Scott
>
> On Oct 6, 2021, at 8:25 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:
>
>
>
> Hi folks,
>
> Thanks for discussion on this proposal, and also to Benedict who’s been 
> fielding questions on the list!
>
> I’d like to restate the goals and problem statement captured by this proposal 
> and frame context.
>
> Today, lightweight transactions limit users to transacting over a single 
> partition. This unit of atomicity has a very low upper limit in terms of the 
> amount of data that can be CAS’d over; and doing so leads many to design 
> contorted data models to cram different types of data into one partition for 
> the purposes of being able to CAS over it. We propose that Cassandra can and 
> should be extended to remove this limit, enabling users to issue one-shot 
> transactions that CAS over multiple keys – including CAS batches, which may 
> modify multiple keys.
>
> To enable this, the CEP authors have designed a novel, leaderless paxos-based 
> protocol unique to Cassandra, offered a proof of its correctness, a 
> whitepaper outlining it in detail, along with a prototype implementation to 
> incubate development, and integrated it with Maelstrom from jepsen.io to 
> validate linearizability as more specific test infrastructure is developed. 
> This rigor is remarkable, and I’m thrilled to see such a degree of investment 
> in the area.
>
> Even users who do not require the capability to transact across partition 
> boundaries will benefit. The protocol reduces message/WAN round-trips by 4x 
> on writes (4 → 1) and 2x on reads (2 → 1) in the common case against today’s 
> baseline. These latency improvements coupled with the enhanced flexibility of 
> what can be transacted over in Cassandra enable new classes of applications 
> to use the database.
>
> In particular, 1xRTT read/write transactions across partitions enable 
> Cassandra to be thought of not just as a strongly consistent database, but 
> even a transactional database - a mode many may even prefer to use by 
> default. Given this capability, Apache Cassandra has an opportunity to become 
> one of – or perhaps the only – database in the industry that can store 
> multiple petabytes of data in a single database; replicate it across many 
> regions; and allow users to transact over any subset of it. These are 
> capabilities that can be met by no other system I’m aware of on the market. 
> Dynamo’s transactions are single-DC. Google Cloud BigTable does not support 
> transactions. Spanner, Aurora, CloudSQL, and RDS have far lower scalability 
> limits or require specialized hardware, etc.
>
> This is an incredible opportunity for Apache Cassandra - to surpass the 
> scalability and transactional capability of some of the most advanced systems 
> in our industry - and to do so in open source, where anyone can download and 
> deploy the software to achieve this without cost; and for students and 
> researchers to learn from and build upon as well (a team from UT-Austin has 
> already reached out to this effect).
>
> As Benedict and Blake noted, the scope of what’s captured in this proposal is 
> also not terminal. While the first implementation may extend today’s CAS 
> semantics to multiple partitions with lower latency, the foundation is 
> suitable to build interactive transactions as well — which would be 
> remarkable and is something that I hadn’t considered myself at the onset of 
> this project.
>
> To that end, the CEP proposes the protocol, offers a validated 
> implementation, and the initial capability of extending today’s 
> single-partition transactions to multi-partition; while providing the 
> flexibility to build upon this work further.
>
> A simple example of what becomes possible when this work lands and is 
> integrated might be:
>
> –––
> BEGIN BATCH
> UPDATE tbl1 SET value1 = newValue1 WHERE partitionKey = k1
> UPDATE tbl2 SET value2 = newValue2 WHERE partitionKey = k2 AND conditionValue 
> = someCondition
> APPLY BATCH
> –––
>
> I understand that this query is present in the CEP and my intent isn’t to 
> recommend that folks reread it if they’ve given a careful reading already. 
> But I do think it’s important to elaborate upon what becomes possible when 
> this query can be issued.
>
> Users of Cassandra who have designed data models that cram many types of data 
> into a single partition for the purposes of atomicity no longer need to. They 
> can design their applications with appropriate schemas that wouldn’t leave 
> Codd holding his nose. They’re no longer pushed into antipatterns that result 
> in these partitions becoming huge and potentially unreadable. Cassandra 
> doesn’t become fully relational in this CEP - but it becomes possible and 
> even easy to design applications that transact across tables that mimic a 
> large amount of relational functionality. And for users who are content to 
> transact over a single table, they’ll find those transactions become up to 4x 
> faster today due to the protocol’s reduction in round-trips. The library’s 
> loose coupling to Apache Cassandra and ability to be incubated out-of-tree 
> also enables other applications to take advantage of the protocol and is a 
> nice step toward bringing modularity to the project. There are a lot of good 
> things happening here.
>
> I know I’m listed as an author - but figured I should go on record to say “I 
> support this CEP.” :)
>
> Thanks,
>
> – Scott
>
> On Oct 6, 2021, at 8:05 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>
> The problem that I keep pointing out is that you've created this CEP for
> Accord without first getting consensus that the goals and the tradeoffs it
> makes to achieve those goals (and that it will impose on future work around
> transactions) are the right ones for Cassandra long term.
>
> At this point I'm done repeating myself. For the convenience of anyone
> following this thread intermittently, I'll quote my first reply on this
> thread to illustrate the kind of discussion I'd like to have.
>
> -----
>
> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful. In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
> Here’s an example of what I mean:
>
> =Calvin=
>
> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> transactions, then replicas execute the transactions independently with no
> further coordination. No SPOF. Transactions are batched by each sequencer
> to keep this from becoming a bottleneck.
>
> Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> with 7GB ram and 8 virtual cores). Note that TPC-C New Order is composed
> of four reads and four writes, so this is effectively 2M reads and 2M
> writes as we normally measure them in C*.
>
> Calvin supports mixed read/write transactions, but because the transaction
> execution logic requires knowing all partition keys in advance to ensure
> that all replicas can reproduce the same results with no coordination,
> reads against non-PK predicates must be done ahead of time (transparently,
> by the server) to determine the set of keys, and this must be retried if
> the set of rows affected is updated before the actual transaction executes.
>
> Batching and global consensus adds latency -- 100ms in the Calvin paper and
> apparently about 50ms in FaunaDB. Glass half full: all transactions
> (including multi-partition updates) are equally performant in Calvin since
> the coordination is handled up front in the sequencing step. Glass half
> empty: even single-row reads and writes have to pay the full coordination
> cost. Fauna has optimized this away for reads but I am not aware of a
> description of how they changed the design to allow this.
>
> Functionality and limitations: since the entire transaction must be known
> in advance to allow coordination-less execution at the replicas, Calvin
> cannot support interactive transactions at all. FaunaDB mitigates this by
> allowing server-side logic to be included, but a Calvin approach will never
> be able to offer SQL compatibility.
>
> Guarantees: Calvin transactions are strictly serializable. There is no
> additional complexity or performance hit to generalizing to multiple
> regions, apart from the speed of light. And since Calvin is already paying
> a batching latency penalty, this is less painful than for other systems.
>
> Application to Cassandra: B-. Distributed transactions are handled by the
> sequencing and scheduling layers, which are leaderless, and Calvin’s
> requirements for the storage layer are easily met by C*. But Calvin also
> requires a global consensus protocol and LWT is almost certainly not
> sufficiently performant, so this would require ZK or etcd (reasonable for a
> library approach but not for replacing LWT in C* itself), or an
> implementation of Accord. I don’t believe Calvin would require additional
> table-level metadata in Cassandra.
>
> On Wed, Oct 6, 2021 at 9:53 AM bened...@apache.org <bened...@apache.org>
> wrote:
>
> The problem with dropping a patch on Jira is that there is no opportunity
> to point out problems, either with the fundamental approach or with the
> specific implementation. So please point out some problems I can engage
> with!
>
>
> From: Jonathan Ellis <jbel...@gmail.com>
> Date: Wednesday, 6 October 2021 at 15:48
> To: dev <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Oct 6, 2021 at 9:21 AM bened...@apache.org <bened...@apache.org>
> wrote:
>
> > The goals of the CEP are stated clearly, and these were the goals we had
> > going into the (multi-month) research project we undertook before
> proposing
> > this CEP. These goals are necessarily value judgements, so we cannot
> expect
> > that everyone will agree that they are optimal.
> >
>
> Right, so I'm saying that this is exactly the most important thing to get
> consensus on, and creating a CEP for a protocol to achieve goals that you
> have not discussed with the community is the CEP equivalent of dropping a
> patch on Jira without discussing its goals either.
>
> That's why our conversations haven't gone anywhere, because I keep saying
> "we need discuss the goals and tradeoffs", and I'll give an example of what
> I mean, and you keep addressing the examples (sometimes very shallowly, "it
> would be possible to X" or "Y could be done as an optimization") while
> ignoring the request to open a discussion around the big picture.
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to