Thanks a lot for the info!
I see, the paxos protocol used now in the code is actually the "single-decree synod" protocol, which votes on only one value. the scope of the implementation is only the CAS operation (which contains a read and write), not a generic txn (which could contain arbitrarily many operations). for the generic txn the multi-degree protocol would be needed. here CAS is able to work on top of the synod because the read is essentially "sandwitched/bounded" between the proposal and accept, so that no other ballot can get in between (the line https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L273 checks this). On Mon, Aug 3, 2015 at 4:29 AM, DuyHai Doan <doanduy...@gmail.com> wrote: > "you seem to be suggesting that the "other operations on the same > partition key have to wait" because Paxos grouped the first series > together, which have to be committed in the same order , before all other > operations, essentially ___serializing___ the operations (with guaranteed > same order)." --> No, the implementation does not group any Paxos > operation together. And when I said about (INSERT, UPDATE, DELETE ...) I > didn't mean a group of operations, just individual INSERT or UPDATE or > DELETE operation that can occur at any moment. > > Indeed there are 3 scenarios for dueling proposals P1 & P2: > > 1. P1 has not been accepted yet and P2 has higher ballot than P1, then P1 > will abort, sleep for a random amount of time and re-propose later. This is > in order to give P2 a chance to complete its Paxos round. > > 2. P1 has been accepted (phase Propose/Accept successful) and P2 has > higher ballot than P1, then the coordinator that issued P2 has to commit P1 > first before re-proposing P2 > > 3. P2 has lower ballot than P1, then P2 is rejected at Prepare/Promise > phase > > "I guess Cassandra must be doing something to prevent "the second guy > injecting his operation before DELETE" in the above scenario" --> Since > there is no grouping of Paxos operations (not to be confused with BATCH > statement with one Paxos operation!), C* does nothing to prevent a second > client to issue a PAXOS between the UPDATE and DELETE. > > If you're interested, you can look into the source code here: > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L202 > > The Javadoc is also interesting to read because it explains briefly the > semantics > > > > On Mon, Aug 3, 2015 at 11:32 AM, Yang <teddyyyy...@gmail.com> wrote: > >> thanks for your answer DuyHai. >> >> I understand Paxos. but I think your description seems missing one >> important point: in the example you gave, "a series of ongoing operation >> (INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other >> operations on the same partition key have to wait" because Paxos grouped >> the first series together, which have to be committed in the same order , >> before all other operations, essentially ___serializing___ the operations >> (with guaranteed same order). >> >> but Paxos ONLY guarantees order of the operations as they are proposed. >> Paxos itself can not control when a operation is proposed. for example in >> the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully >> allowed to propose his operation (say another UPDATE) before DELETE is >> proposed, and hence get the latest ballot number (smaller than that for >> DELETE), so the final committed sequence is INSERT UPDATE >> op_from_another_guy, DELETE ...... >> >> I guess Cassandra must be doing something to prevent "the second guy >> injecting his operation before DELETE" in the above scenario, that seems to >> be some transaction manager which is not yet clearly described in the >> slides u gave. >> >> if that is correct, >> my point is, if we let the above transaction manager work with the >> standard replication protocol, don't we also get transaction behavior? >> >> >> On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan <doanduy...@gmail.com> >> wrote: >> >>> "what is the fundamental difference between the standard replication >>> protocol and Paxos that prevents us from implementing a 2-pc on top of the >>> standard protocol?" >>> >>> --> for a more detailed description of Paxos, look here: >>> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41 >>> >>> Long story short, when there is an ongoing operation (INSERT, UPDATE, >>> DELETE, ...) on a particular partition key with Paxos, any other concurrent >>> operation on the same partition key will have to wait until the ongoing >>> operation commits. >>> >>> If the ongoing operation is validated by Paxos but fails before being >>> able to commit (after the accept phase in the diagram), then any subsequent >>> operation on this partition key will commit this stalled operation before >>> starting its own. >>> >>> >>> >>> On Mon, Aug 3, 2015 at 4:30 AM, Yang <teddyyyy...@gmail.com> wrote: >>> >>>> this link >>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 >>>> talks about linearizable consistency and lightweight transactions. >>>> >>>> but I am still not completely following it , just based on the article >>>> itself. >>>> >>>> the standard replication protocol in Cassandra does establish a total >>>> order (based on client TS, though that can be wrong/unfair), so in the >>>> case of the example mentioned in the article "if 2 people try to create the >>>> same account', yes if both of them just brute-force write, ultimately we >>>> will have one winner, who provided the higher TS (this is done consistently >>>> across all replicas). >>>> >>>> what really matters in the above situation is the ability to group the >>>> 2 operations "check existing account" and "create account" together and run >>>> them in an atomic way. so we need something like a 2-phase commit. >>>> >>>> I guess what is not clear from that article is , what is the >>>> fundamental difference between the standard replication protocol and Paxos >>>> that prevents us from implementing a 2-pc on top of the standard protocol? >>>> >>>> Thanks! >>>> yang >>>> >>> >>> >> >