[jira] [Commented] (CASSANDRA-6246) EPaxos

Blake Eggleston (JIRA) Tue, 11 Nov 2014 18:20:06 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207557#comment-14207557
 ]


Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------

Thanks Sankalp. Since my last post, I've been cleaning things up and improving 
the tests. Sorry for the delay pushing it up.

I also found a problem in the execution phase that was slowing things down. 
Epaxos is now 40% faster than the existing implementation in uncontended 
workloads, and 20x faster in contended workloads.

Here are the performance numbers: 
https://docs.google.com/spreadsheets/d/1inBuO5bxo_b36jnTn5Ff9UCOhnMGLcx6EyNxp2nFM_Q/edit?usp=sharing

bq. 1) In the DependencyManger, we might want to keep the last executed 
instance otherwise we won't know if the next one depends on the previous one or 
we have missed any in between. 

 Instances only become eligible for eviction when they’ve been both executed 
and acknowledged. An executed instance will be a dependency of at least one 
additional instance before being evicted from the manager.

{quote}
2) You might want to create java packages and move files there. For example in 
repair code, org.apache.cassandra.repair.messages where we keep all the Request 
Responses. We can do the same for verb handler, etc. 
3) We should add the new verbs to DatabaseDescriptor.getTimout(). Otherwise 
they will use the default timeout. I fixed this for current paxos 
implementation in CASSANDRA-7752
4) PreacceptResponse.failure can also accept missingInstances in the 
constructor. You can make it final and not volatile. 
{quote}

I'll look into these

bq. 5) ExecutionSorter.getOrder(). Here if condition uncommitted.size() == 0 is 
always true. Also loadedScc is empty as we don't insert into it. 

ids are being put into uncommitted in the addInstance method, so it won’t 
always equal 0, good catch on the loadedScc though. I’ll get that fixed.

bq. 6) In ExecuteTask.run(), Instance toExecute = 
state.loadInstance(toExecuteId); should be within the try as we are holding a 
lock. 

fixed in the cleaned up code

bq. 7) EpaxosState.commitCallbacks could be a multi map. 

agreed, I'll update

{quote}
8) In Instance.java, successors, noop and fastPathPossible are not used. We can 
also get rid of Instance.applyRemote() method.
14) ParticipantInfo.endpoints will not be required once we remove the 
Epaxos.getSuccessors()
{quote}

successors and noop will be used in the prepare and execute phases 
respectively, fastPathImpossible should be removed through. 

bq. 9) PreacceptCallback.ballot need not be an instance variable as we set 
completed=true after we set it. 

agreed, I'll update

{quote}
10) PreacceptResponse.missingInstance is not required as it can be calculated 
on the leader in the PreacceptCallback. 
11) EpaxosState.accept(). We can filter out the skipPlaceholderPredicate when 
we calculated missingInstances in PreacceptCallback.getAcceptDecision()
{quote}

Missing instances are sent both ways. When a node responds to a preaccept 
message, if it believes the leader is missing an instance, it will include it 
in it's response. Once the leader has received all the responses, if it thinks 
any of the replicas are missing instances, it will send them along.

{quote}
12) PreacceptCallback.getAcceptDecision() We don't need to calculate missingIds 
if accept is going to be false in AcceptDecision. 
13) ParticipantInfo.remoteEndpoints. Here we are not doing any isAlive check 
and just sending messages to all remote endpoints. 
{quote}

I'll fix

bq. 15) Accept is send to live local endpoints and to all remote endpoints. In 
AcceptCallback, I think we should count response from only local endpoints 

fixed in cleaned up code

bq. 16) When we execute the instance in ExecuteTask, what if we crash after 
executing the instance but before recording it.

Saving the best for last I see :)
The existing implementation has this problem as well. Cassandra doesn't have a 
way to mutate multiple keyspaces with a single commit log entry (that I've 
found). We could collect the mutations from the actual cas write, the 
dependency manager update, and the instances update and hold off on applying 
them until the very end, but that only makes the problem less likely.

Speaking of which, the default of not waiting for an fsync before considering a 
write successful is a more serious problem for paxos/epaxos, since a paxos node 
forgetting it's state can cause inconsistencies.

I'll give this and your timeline consistency question some more thought.

> EPaxos
> ------
>
>                 Key: CASSANDRA-6246
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Blake Eggleston
>            Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is 
> that Multi-paxos requires leader election and hence, a period of 
> unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, 
> (2) is particularly useful across multiple datacenters, and (3) allows any 
> node to act as coordinator: 
> http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to 
> implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-6246) EPaxos

Reply via email to