[ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230331#comment-14230331
 ] 

Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------

Since it looks like the performance improvements from epaxos could be worth the 
(substantial) added complexity, I’ve been thinking through problems are caused 
by the need to  garbage collect instances, and repair causing inconsistencies 
by sending data from ‘the future’.

For repair, the only thing I’ve thought of that would work 100% of the time 
would be to count executed instances for a partition, and to send that count 
along with the repair request. If the remote count is higher than the local 
count, we know for sure that it has data from the future, and the repair for 
that partition should be deferred.

For garbage collection, we’ll need to support a failure recovery mode that 
works without all historical instances. We also need a way to quickly determine 
if a prepare phase should be used, or we need a epaxos repair type operation to 
bring a node up to speed.

Breaking the continuous execution space of partition ranges into discrete 
epochs would give us a relatively straightforward way of solving all of these 
problems. Each partition range will have it’s own epoch number. At a given 
instance number threshold, time threshold, or event, epaxos will run an epoch 
increment instance. It will take every active instance in it’s partition range 
as a dependency. Any instance executed before the epoch instance belongs to the 
last epoch, any executed after belong to the new one.

How this would solve the outstanding problems:

Garbage Collection: Any instance from 2 or more epochs ago can be deleted. 
Although epoch incrementing instances doesn’t prevent dependencies on the 
previous epoch, it does prevent dependencies from the previous-1 epoch

Repair: Counting executions allows us to determine if repair data is from the 
future. Epochs let us scope execution counts to an epoch. If the epoch has 
incremented twice without new executions for a partition, the bookkeeping data 
for that partition can be deleted. This gives us a race free way to delete old 
execution counts, preventing keeping bookkeeping data around forever.

Failure recovery: Using epochs makes deciding to use prepare or failure 
recovery unambiguous. If a node is missing instances that are from 2 or more 
epochs ago, it will need to run a failure recovery. Otherwise, prepare phases 
will work. Additionally, using an epaxos instance as the method of incrementing 
epochs guarantees that a given instance has been executed once the epoch has 
been incremented twice.

> EPaxos
> ------
>
>                 Key: CASSANDRA-6246
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Blake Eggleston
>            Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to