[ 
https://issues.apache.org/jira/browse/CASSANDRA-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476301#comment-17476301
 ] 

Benedict Elliott Smith commented on CASSANDRA-17164:
----------------------------------------------------

{quote}Is there any location where the algorithm is described in detail?
{quote}
The documentation that is present sort of assumes you are familiar with the 
prior implementation of Paxos, and we pepper in justifications at each place 
where a novel approach is taken. I will try to put together some overview 
markdown documentation over the coming week. In the meantime:
{quote}the use of voting quorum that is not selected only among the replicas 
that accepted a ballot
{quote}
I think this is actually the typical way of implementing Classic Paxos, even 
though Lamport's paper seems to suggest you must only contact the nodes that 
responded to the prepare (there may be something else specific about his 
formulation that necessitates this, I forget, as I dislike his writings on the 
topic). This is corroborated by [Heidi Howard's 
dissertation|https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf], which 
was the easiest place I could find a straight-forward formulation of Classic 
Paxos besides that of Lamport. See Algorithm 3 on Page 30.
{quote}the use of "most recent commit" as a voting session identifier
{quote}
I don't quite follow what you mean by this, as this is not limited to "most 
recent commit", but a ballot directly maps to the instance id of classic paxos, 
it just avoids pre-splitting the range of integers.
{quote}the sharing of ballot numbers between sessions and rejection/acceptance 
based solely on ballot numbers which may belong to a different voting session
{quote}
Could you explain what you are referring to here? I think this is all standard 
stuff for Paxos, we're again just recording the most recently used instance 
number for each register.
{quote}advancing voting sessions without committing empty proposals
{quote}
The final commit phase is only required to ensure any "decree" (decision) is 
disseminated. If we have proposed that no decree be made, there is nothing to 
disseminate, and nothing to complete if another transaction encounters it. This 
is in some ways an artefact of the feature of Cassandra's implementation, that 
we initiate a paxos round without knowing if it will do anything, though this 
feature would I suppose be present for read-only operations anyway.
{quote}replicas skipping voting sessions because of stale participant refresh
{quote}
There's two possible things you mean by this. I think you are referring to the 
situation where we send a commit and then continue with the ballot we have 
already prepared? In which case I'm not sure this is really in conflict with 
any formulation I've seen, which tends to gloss over handling of {_}commit{_}, 
and I think may arise solely from the particulars of Cassandra - we are not 
updating a register, but are agreeing a delta, and only disseminate this to any 
majority (that may be different from the one that received any prior delta), 
and so we must ensure that each _Commit_ is witnessed by a majority so that the 
complete register state may be constructed from any majority. In normal 
formulations the register is overwritten, so I don't think the _Commit_ even 
needs to be received if it is superseded by another {_}Commit{_}, and I think 
many formulations ignore it entirely as a result.

Anyway to justify it seems pretty straightforward: if any other command were to 
supersede us we would fail the _Accept_ phase, and if not then by updating the 
_MostRecentCommit_ register we know precisely what the register state is on the 
node, and it is equivalent to having received this response in the first place, 
so we may proceed safely.
{quote}read vs write promises
{quote}
This is just a very simple formulation of operation commutativity. We linearise 
writes with writes and reads, but we do not linearise reads with each other 
since they are commutative. So any read operation only consults the write 
registers, but updates the read registers, whereas writes update the write 
registers and consult both.
{quote}the handling of range movements
{quote}
Fair, this is quite complex, and we should have already put in an overview 
here. In simple terms, each node tracks those operations that have been 
witnessed but are not known to have committed. Each node is able to coordinate 
the completion of these operations, either by invalidating them, committing 
them, or witnessing something newer. By performing this on a majority of nodes 
we are able to ensure that all operations that may have reached a decision 
prior to this mechanism being invoked are now committed to a majority of nodes 
in their base table. By performing this after a node becomes pending but before 
streaming begins we ensure that a new node was either already participating in 
any operation and will be informed of it, or that it will receive its data via 
bootstrap.
{quote}state expiration
{quote}
Using the same mechanism as described above, each range has a global lower 
bound on ballots that are not known to have committed on a majority of nodes, 
and will discount any incomplete operations with a lower ballot. Therefore the 
data associated with these ballots can all be expunged. This requires regular 
paxos repairs to be run, which can either occur as part of incremental / 
regular repair, or be scheduled separately. In practice this means much faster 
expiration, and that users whom enable this can use ANY commit consistency 
level. We also need to provide some NEWS information explaining all of this.

Does that at least get you moving forward, while I work on a more comprehensive 
overview?

> CEP-14: Paxos Improvements
> --------------------------
>
>                 Key: CASSANDRA-17164
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17164
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Consistency/Repair
>            Reporter: Benedict Elliott Smith
>            Assignee: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.1
>
>
> This ticket encompasses work for [CEP-14|
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to