[ 
https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586371#comment-13586371
 ] 

Cristian Opris commented on CASSANDRA-5062:
-------------------------------------------

Afaict from the Spinnaker paper they only require ZK for fault tolerant leader 
election, failure detection and possibly cluster membership. (The right lower 
box in the diagram in 4.1) The rest of it their actual data storage engine.

A few more comments:

1. Paxos can be made very efficient particularly in stable operation scenarios. 
I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a 
stable leader. So you can normally do writes with a single roundtrip just like 
now. 

2. There is a difference between what I described above and what Spinnaker 
does. I believe they elect a leader for the entire replica group while my 
description assumes 1 full paxos instance per row write. I'm not fully clear 
atm how this would work but I believe even that can be optimized to single 
roundtrips per write in normal operation (I believe it's in one of Google's 
papers that they piggyback the commit on the next proposal for example) 

Off the top of my head: coordinator assumes one of the replicas as being most 
up-to-date, attempts to use it as leader. Replica starts Paxos round attaching 
the write payload. If accepted on a majority replica can send commit. 
Opportunistically attaches further proposals to it. If Paxos round fails (or a 
number of rounds fail) it's likely the replica is behind on many rows so 
coordinator switches to another replica.

Now this is all preliminary as I haven't fully thought this through but I think 
it's definitely worth investigating. While it may be a complicated protocol it 
has significan performance advantages over locks. Just count how many 
roundtrips you'd need in the "wait chain" algorithm. Not to mentioned handling 
expired/orphan locks


                
> Support CAS
> -----------
>
>                 Key: CASSANDRA-5062
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 2.0
>
>
> "Strong" consistency is not enough to prevent race conditions.  The classic 
> example is user account creation: we want to ensure usernames are unique, so 
> we only want to signal account creation success if nobody else has created 
> the account yet.  But naive read-then-write allows clients to race and both 
> think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to