Thanks a lot for the info!

I see,  the paxos protocol used now in the code is actually the
"single-decree synod"  protocol, which votes on only one value.

the scope of the implementation is only the CAS operation (which contains a
read and write), not a generic txn (which could contain arbitrarily many
operations).

 for the generic txn the multi-degree protocol would be needed. here CAS is
able to work on top of the synod because the read is essentially
"sandwitched/bounded" between the proposal and accept, so that no other
ballot can get in between (the line
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L273
checks this).

On Mon, Aug 3, 2015 at 4:29 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> "you seem to be suggesting that the "other operations on the same
> partition key have to wait" because Paxos grouped the first series
> together, which have to be committed in the same order , before all other
> operations, essentially ___serializing___ the operations (with guaranteed
> same order)."  --> No, the implementation does not group any Paxos
> operation together. And when I said about (INSERT, UPDATE, DELETE ...) I
> didn't mean a group of operations, just individual INSERT or UPDATE or
> DELETE operation that can occur at any moment.
>
> Indeed there are 3 scenarios for dueling proposals P1 & P2:
>
> 1. P1 has not been accepted yet and P2 has higher ballot than P1, then P1
> will abort, sleep for a random amount of time and re-propose later. This is
> in order to give P2 a chance to complete its Paxos round.
>
> 2. P1 has been accepted (phase Propose/Accept successful) and P2 has
> higher ballot than P1, then the coordinator that issued P2 has to commit P1
> first before re-proposing P2
>
> 3. P2 has lower ballot than P1, then P2 is rejected at Prepare/Promise
> phase
>
> "I guess Cassandra must be doing something to prevent "the second guy
> injecting his operation before DELETE" in the above scenario" --> Since
> there is no grouping of Paxos operations (not to be confused with BATCH
> statement with one Paxos operation!), C* does nothing to prevent a second
> client to issue a PAXOS between the UPDATE and DELETE.
>
> If you're interested, you can look into the source code here:
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L202
>
> The Javadoc is also interesting to read because it explains briefly the
> semantics
>
>
>
> On Mon, Aug 3, 2015 at 11:32 AM, Yang <teddyyyy...@gmail.com> wrote:
>
>> thanks for your answer DuyHai.
>>
>> I understand Paxos. but I think your description seems missing one
>> important point: in the example you gave, "a series of ongoing operation
>> (INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other
>> operations on the same partition key have to wait" because Paxos grouped
>> the first series together, which have to be committed in the same order ,
>> before all other operations, essentially ___serializing___ the operations
>> (with guaranteed same order).
>>
>> but Paxos ONLY guarantees order of the operations as they are proposed.
>> Paxos itself can not control when a operation is proposed. for example in
>> the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully
>> allowed to propose his operation (say another UPDATE) before DELETE is
>> proposed, and hence get the latest ballot number (smaller than that for
>> DELETE), so the final committed sequence is INSERT UPDATE
>> op_from_another_guy, DELETE ......
>>
>> I guess Cassandra must be doing something to prevent "the second guy
>> injecting his operation before DELETE" in the above scenario, that seems to
>> be some transaction manager which is not yet clearly described in the
>> slides u gave.
>>
>> if that is correct,
>> my point is, if we let  the above transaction manager work with the
>> standard replication protocol, don't we also get transaction behavior?
>>
>>
>> On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> "what is the fundamental difference between the standard replication
>>> protocol and Paxos that prevents us from implementing a 2-pc on top of the
>>> standard protocol?"
>>>
>>> --> for a more detailed description of Paxos, look here:
>>> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41
>>>
>>> Long story short, when there is an ongoing operation (INSERT, UPDATE,
>>> DELETE, ...) on a particular partition key with Paxos, any other concurrent
>>> operation on the same partition key will have to wait until the ongoing
>>> operation commits.
>>>
>>> If the ongoing operation is validated by Paxos but fails before being
>>> able to commit (after the accept phase in the diagram), then any subsequent
>>> operation on this partition key will commit this stalled operation before
>>> starting its own.
>>>
>>>
>>>
>>> On Mon, Aug 3, 2015 at 4:30 AM, Yang <teddyyyy...@gmail.com> wrote:
>>>
>>>> this link
>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>>>> talks about linearizable consistency and lightweight transactions.
>>>>
>>>> but I am still not completely following it , just based on the article
>>>> itself.
>>>>
>>>> the standard replication protocol in Cassandra does establish a total
>>>> order (based on client TS, though that can be wrong/unfair),  so in the
>>>> case of the example mentioned in the article "if 2 people try to create the
>>>> same account', yes if both of them just brute-force write, ultimately we
>>>> will have one winner, who provided the higher TS (this is done consistently
>>>> across all replicas).
>>>>
>>>> what really matters in the above situation is the ability to group the
>>>> 2 operations "check existing account" and "create account" together and run
>>>> them in an atomic way.  so we need something like a 2-phase commit.
>>>>
>>>> I guess what is not clear from that article is , what is the
>>>> fundamental difference between the standard replication protocol and Paxos
>>>> that prevents us from implementing a 2-pc on top of the standard protocol?
>>>>
>>>> Thanks!
>>>> yang
>>>>
>>>
>>>
>>
>

Reply via email to