[ https://issues.apache.org/jira/browse/CASSANDRA-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476301#comment-17476301 ]
Benedict Elliott Smith commented on CASSANDRA-17164: ---------------------------------------------------- {quote}Is there any location where the algorithm is described in detail? {quote} The documentation that is present sort of assumes you are familiar with the prior implementation of Paxos, and we pepper in justifications at each place where a novel approach is taken. I will try to put together some overview markdown documentation over the coming week. In the meantime: {quote}the use of voting quorum that is not selected only among the replicas that accepted a ballot {quote} I think this is actually the typical way of implementing Classic Paxos, even though Lamport's paper seems to suggest you must only contact the nodes that responded to the prepare (there may be something else specific about his formulation that necessitates this, I forget, as I dislike his writings on the topic). This is corroborated by [Heidi Howard's dissertation|https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf], which was the easiest place I could find a straight-forward formulation of Classic Paxos besides that of Lamport. See Algorithm 3 on Page 30. {quote}the use of "most recent commit" as a voting session identifier {quote} I don't quite follow what you mean by this, as this is not limited to "most recent commit", but a ballot directly maps to the instance id of classic paxos, it just avoids pre-splitting the range of integers. {quote}the sharing of ballot numbers between sessions and rejection/acceptance based solely on ballot numbers which may belong to a different voting session {quote} Could you explain what you are referring to here? I think this is all standard stuff for Paxos, we're again just recording the most recently used instance number for each register. {quote}advancing voting sessions without committing empty proposals {quote} The final commit phase is only required to ensure any "decree" (decision) is disseminated. If we have proposed that no decree be made, there is nothing to disseminate, and nothing to complete if another transaction encounters it. This is in some ways an artefact of the feature of Cassandra's implementation, that we initiate a paxos round without knowing if it will do anything, though this feature would I suppose be present for read-only operations anyway. {quote}replicas skipping voting sessions because of stale participant refresh {quote} There's two possible things you mean by this. I think you are referring to the situation where we send a commit and then continue with the ballot we have already prepared? In which case I'm not sure this is really in conflict with any formulation I've seen, which tends to gloss over handling of {_}commit{_}, and I think may arise solely from the particulars of Cassandra - we are not updating a register, but are agreeing a delta, and only disseminate this to any majority (that may be different from the one that received any prior delta), and so we must ensure that each _Commit_ is witnessed by a majority so that the complete register state may be constructed from any majority. In normal formulations the register is overwritten, so I don't think the _Commit_ even needs to be received if it is superseded by another {_}Commit{_}, and I think many formulations ignore it entirely as a result. Anyway to justify it seems pretty straightforward: if any other command were to supersede us we would fail the _Accept_ phase, and if not then by updating the _MostRecentCommit_ register we know precisely what the register state is on the node, and it is equivalent to having received this response in the first place, so we may proceed safely. {quote}read vs write promises {quote} This is just a very simple formulation of operation commutativity. We linearise writes with writes and reads, but we do not linearise reads with each other since they are commutative. So any read operation only consults the write registers, but updates the read registers, whereas writes update the write registers and consult both. {quote}the handling of range movements {quote} Fair, this is quite complex, and we should have already put in an overview here. In simple terms, each node tracks those operations that have been witnessed but are not known to have committed. Each node is able to coordinate the completion of these operations, either by invalidating them, committing them, or witnessing something newer. By performing this on a majority of nodes we are able to ensure that all operations that may have reached a decision prior to this mechanism being invoked are now committed to a majority of nodes in their base table. By performing this after a node becomes pending but before streaming begins we ensure that a new node was either already participating in any operation and will be informed of it, or that it will receive its data via bootstrap. {quote}state expiration {quote} Using the same mechanism as described above, each range has a global lower bound on ballots that are not known to have committed on a majority of nodes, and will discount any incomplete operations with a lower ballot. Therefore the data associated with these ballots can all be expunged. This requires regular paxos repairs to be run, which can either occur as part of incremental / regular repair, or be scheduled separately. In practice this means much faster expiration, and that users whom enable this can use ANY commit consistency level. We also need to provide some NEWS information explaining all of this. Does that at least get you moving forward, while I work on a more comprehensive overview? > CEP-14: Paxos Improvements > -------------------------- > > Key: CASSANDRA-17164 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17164 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Consistency/Repair > Reporter: Benedict Elliott Smith > Assignee: Benedict Elliott Smith > Priority: Normal > Fix For: 4.1 > > > This ticket encompasses work for [CEP-14| > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements]. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org