Re: How does cassandra achieve Linearizability?

Ariel Weisberg Wed, 22 Feb 2017 06:48:07 -0800

Hi,



No it's not going to be in 3.11.x. The earliest release it could make it
into is 4.0.


Ariel



On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:

> Hi Ariel,

> 

> Can we really expect the fix in 3.11.x as the ticket
> https://issues.apache.org/jira/browse/CASSANDRA-6246[1] says?
> 

> Thanks,

> kant

> 

> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
> <ar...@weisberg.ws> wrote:
>> __

>> Hi,

>> 

>> That would work and would help a lot with the dueling proposer issue.
>> 

>> A lot of the leader election stuff is designed to reduce the number
>> of roundtrips and not just address the dueling proposer issue. Those
>> will have downtime because it's there for correctness. Just adding an
>> affinity for a specific proposer is probably a free lunch.
>> 

>> I don't think you can group keys because the Paxos proposals are per
>> partition which is why we get linear scale out for Paxos. I don't
>> believe it's linearizable across multiple partitions. You can use the
>> clustering key and deterministically pick one of the live replicas
>> for that clustering key. Sort the list of replicas by IP, hash the
>> clustering key, use the hash as an index into the list of replicas.
>> 

>> Batching is of limited usefulness because we only use Paxos for CAS I
>> think? So in a batch by definition all but one will fail the CAS.
>> This is something where a distinguished coordinator could help by
>> failing the rest of the contending requests more inexpensively than
>> it currently does.
>> 

>> 

>> Ariel

>> 

>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:

>>> 

>>> 

>>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg <ar...@weisberg.ws>
>>> wrote:
>>>> __

>>>> Hi,

>>>> 

>>>> Classic Paxos doesn't have a leader. There are variants on the
>>>> original Lamport approach that will elect a leader (or some other
>>>> variation like Mencius) to improve throughput, latency, and
>>>> performance under contention. Cassandra implements the approach
>>>> from the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb)
>>>> with no additional optimizations that I am aware of. There is no
>>>> distinguished proposer (leader).
>>>> 

>>>> That paper does  go on to discuss electing a distinguished
>>>> proposer, but that was never done for C*. I believe it's not
>>>> considered a good fit for C* philosophically.
>>>> 

>>>> Ariel

>>>> 

>>>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

>>>>> @Ariel Weisberg EPaxos looks very interesting as it looks like it
>>>>> doesn't need any designated leader for C* but I am assuming the
>>>>> paxos that is implemented today for LWT's requires Leader election
>>>>> and If so, don't we need to have an odd number of nodes or racks
>>>>> or DC's to satisfy N = 2F + 1 constraint to tolerate F failures ?
>>>>> I understand it is not needed when not using LWT's since Cassandra
>>>>> is a master-less system.
>>>>> 

>>>>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali <k...@peernova.com>
>>>>> wrote:
>>>>>> Thanks Ariel! Yes I knew there are so many variations and
>>>>>> optimizations of Paxos. I just wanted to see if we had any plans
>>>>>> on improving the existing Paxos implementation and it is great to
>>>>>> see the work is under progress! I am going to follow that ticket
>>>>>> and read up the references pointed in it
>>>>>> 

>>>>>> 

>>>>>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
>>>>>> <ar...@weisberg.ws> wrote:
>>>>>>> __

>>>>>>> Hi,

>>>>>>> 

>>>>>>> Cassandra's implementation of Paxos doesn't implement many
>>>>>>> optimizations that would drastically improve throughput and
>>>>>>> latency. You need consensus, but it doesn't have to be
>>>>>>> exorbitantly expensive and fall over under any kind of
>>>>>>> contention.
>>>>>>> 

>>>>>>> For instance you could implement EPaxos
>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[2], batch
>>>>>>> multiple operations into the same Paxos round, have an affinity
>>>>>>> for a specific proposer for a specific partition, implement
>>>>>>> asynchronous commit, use a more efficient implementation of the
>>>>>>> Paxos log, and maybe other things.
>>>>>>> 

>>>>>>> 

>>>>>>> Ariel

>>>>>>> 

>>>>>>> 

>>>>>>> 

>>>>>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

>>>>>>>> Hi Kant,

>>>>>>>> 

>>>>>>>> If you read the published papers about Paxos, you will most
>>>>>>>> probably recognize that there is no way to "do it better". This
>>>>>>>> is a conceptional thing due to the nature of distributed
>>>>>>>> systems + the CAP theorem.
>>>>>>>> If you want A+P in the triangle, then C is very expensive. CS
>>>>>>>> is made for A+P mostly with tunable C. In ACID databases this
>>>>>>>> is a completely different thing as they are mostly either not
>>>>>>>> partition tolerant, not highly available or not scalable (in a
>>>>>>>> distributed manner, not speaking of "monolithic super
>>>>>>>> servers").
>>>>>>>> 

>>>>>>>> There is no free lunch ...

>>>>>>>> 

>>>>>>>> 

>>>>>>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali <k...@peernova.com>:

>>>>>>>>> "That’s the safety blanket everyone wants but is extremely
>>>>>>>>> expensive, especially in Cassandra."
>>>>>>>>> 

>>>>>>>>> yes LWT's are expensive. Are there any plans to make this
>>>>>>>>> better?
>>>>>>>>> 

>>>>>>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali
>>>>>>>>> <k...@peernova.com> wrote:
>>>>>>>>>> Hi Jon,

>>>>>>>>>> 

>>>>>>>>>> Thanks a lot for your response. I am well aware that the LWW
>>>>>>>>>> != LWT but I was talking more in terms of LWW with respective
>>>>>>>>>> to LWT's which I believe you answered. so thanks much!
>>>>>>>>>> 

>>>>>>>>>> 

>>>>>>>>>> kant

>>>>>>>>>> 

>>>>>>>>>> 

>>>>>>>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad
>>>>>>>>>> <jonathan.had...@gmail.com> wrote:
>>>>>>>>>>> LWT != Last Write Wins.  They are totally different.  

>>>>>>>>>>> 

>>>>>>>>>>> LWTs give you (assuming you also read at SERIAL) “atomic
>>>>>>>>>>> consistency”, meaning you are able to perform operations
>>>>>>>>>>> atomically and in isolation.  That’s the safety blanket
>>>>>>>>>>> everyone wants but is extremely expensive, especially in
>>>>>>>>>>> Cassandra.  The lightweight part, btw, may be a little
>>>>>>>>>>> optimistic, especially if a key is under contention.  With
>>>>>>>>>>> regard to the “last write” part you’re asking about - w/ LWT
>>>>>>>>>>> Cassandra provides the timestamp and manages it as part of
>>>>>>>>>>> the ballot, and it always is increasing.  See org.apache.ca-
>>>>>>>>>>> ssandra.service.ClientState#getTimestampForPaxos.  From the
>>>>>>>>>>> code:
>>>>>>>>>>> 

>>>>>>>>>>>  * Returns a timestamp suitable for paxos given the
>>>>>>>>>>>    timestamp of the last known commit (or in progress
>>>>>>>>>>>    update).
>>>>>>>>>>>  * Paxos ensures that the timestamp it uses for commits
>>>>>>>>>>>    respects the serial order of those commits. It does so
>>>>>>>>>>>  * by having each replica reject any proposal whose
>>>>>>>>>>>    timestamp is not strictly greater than the last proposal
>>>>>>>>>>>    it
>>>>>>>>>>>  * accepted. So in practice, which timestamp we use for a
>>>>>>>>>>>    given proposal doesn't affect correctness but it does
>>>>>>>>>>>  * affect the chance of making progress (if we pick a
>>>>>>>>>>>    timestamp lower than what has been proposed before, our
>>>>>>>>>>>  * new proposal will just get rejected).

>>>>>>>>>>> 

>>>>>>>>>>> Effectively paxos removes the ability to use custom
>>>>>>>>>>> timestamps and addresses clock variance by rejecting ballots
>>>>>>>>>>> with timestamps less than what was last seen.  You can learn
>>>>>>>>>>> more by reading through the other comments and code in that
>>>>>>>>>>> file.
>>>>>>>>>>> 

>>>>>>>>>>> Last write wins is a free for all that guarantees you
>>>>>>>>>>> *nothing* except the timestamp is used as a tiebreaker.
>>>>>>>>>>> Here we acknowledge things like the speed of light as being
>>>>>>>>>>> a real problem that isn’t going away anytime soon.  This
>>>>>>>>>>> problem is sometimes addressed with event sourcing rather
>>>>>>>>>>> than mutating in place.
>>>>>>>>>>> 

>>>>>>>>>>> Hope this helps.

>>>>>>>>>>> 

>>>>>>>>>>> 

>>>>>>>>>>> Jon

>>>>>>>>>>> 

>>>>>>>>>>> 

>>>>>>>>>>> 

>>>>>>>>>>> 

>>>>>>>>>>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <k...@peernova.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 

>>>>>>>>>>>> @Justin I read this article
>>>>>>>>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
>>>>>>>>>>>> And it clearly says Linearizable consistency can be
>>>>>>>>>>>> achieved with LWT's.  so should I assume the
>>>>>>>>>>>> Linearizability in the context of the above article is
>>>>>>>>>>>> possible with LWT's and synchronization of clocks through
>>>>>>>>>>>> ntpd ? because LWT's also follow Last Write Wins. isn't it?
>>>>>>>>>>>> Also another question does most of the production clusters
>>>>>>>>>>>> do setup ntpd? If so what is the time it takes to sync? any
>>>>>>>>>>>> idea
>>>>>>>>>>>> 

>>>>>>>>>>>> @Micheal Schuler Are you referring to  something like true
>>>>>>>>>>>> time as in
>>>>>>>>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
>>>>>>>>>>>> Actually I never heard of setting up GPS modules and how
>>>>>>>>>>>> that can be helpful. Let me research on that but good
>>>>>>>>>>>> point.
>>>>>>>>>>>> 

>>>>>>>>>>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler
>>>>>>>>>>>> <mich...@pbandjelly.org> wrote:
>>>>>>>>>>>>> If you require the best precision you can get, setting up
>>>>>>>>>>>>> a pair of
>>>>>>>>>>>>> stratum 1 ntpd masters in each data center location with a
>>>>>>>>>>>>> GPS modules
>>>>>>>>>>>>> is not terribly complex. Low latency and jitter on servers
>>>>>>>>>>>>> you manage.
>>>>>>>>>>>>> 140ms is a long way away network-wise, and I would suggest
>>>>>>>>>>>>> that was a
>>>>>>>>>>>>> poor choice of upstream (probably stratum 2 or 3) source.
>>>>>>>>>>>>> 

>>>>>>>>>>>>> As Jonathan mentioned, there's no guarantee from
>>>>>>>>>>>>> Cassandra, but if you
>>>>>>>>>>>>> need as close as you can get, you'll probably need to do
>>>>>>>>>>>>> it yourself.
>>>>>>>>>>>>> 

>>>>>>>>>>>>> (I run several stratum 2 ntpd servers for pool.ntp.org[3])
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>  Kind regards, Michael
>>>>>>>>>>>>>
>>>>>>>>>>>>>  On 02/09/2017 06:47 PM, Kant Kodali wrote:
>>>>>>>>>>>>>  > Hi Justin,
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  > There are bunch of issues w.r.t to synchronization of
>>>>>>>>>>>>>  > clocks when we used ntpd. Also the time it took to sync
>>>>>>>>>>>>>  > the clocks was approx 140ms (don't quote me on it
>>>>>>>>>>>>>  > though because it is reported by our devops :)
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  > we have multiple clients (for example bunch of micro
>>>>>>>>>>>>>  > services are reading from Cassandra) I am not sure how
>>>>>>>>>>>>>  > one can achieve Linearizability by setting timestamps
>>>>>>>>>>>>>  > on the clients ? since there is no total ordering
>>>>>>>>>>>>>  > across multiple clients.
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  > Thanks!
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron
>>>>>>>>>>>>>  > <jus...@instaclustr.com
>>>>>>>>>>>>> > <mailto:jus...@instaclustr.com>> wrote:
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     Hi Kant,
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     Clock synchronization is important - you should
>>>>>>>>>>>>>  >     ensure that ntpd is properly configured on all
>>>>>>>>>>>>>  >     nodes. If your particular use case is especially
>>>>>>>>>>>>>  >     sensitive to out-of-order mutations it is possible
>>>>>>>>>>>>>  >     to set timestamps on the client side using the
>>>>>>>>>>>>>  >     drivers.
>>>>>>>>>>>>>  >     
>>>>>>>>>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>>>>>>>>>>>>>  >     
>>>>>>>>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     We use our own NTP cluster to reduce clock drift as
>>>>>>>>>>>>>  >     much as possible, but public NTP servers are good
>>>>>>>>>>>>>  >     enough for most uses.
>>>>>>>>>>>>>  >     
>>>>>>>>>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>>>>>>>>>>>>>  >     
>>>>>>>>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     Cheers, Justin
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     On Thu, 9 Feb 2017 at 16:09 Kant Kodali
>>>>>>>>>>>>>  >     <k...@peernova.com
>>>>>>>>>>>>> >     <mailto:k...@peernova.com>> wrote:
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >         How does Cassandra achieve Linearizability with
>>>>>>>>>>>>>  >         “Last write wins” (conflict resolution methods
>>>>>>>>>>>>>  >         based on time-of-day clocks) ?
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >         Relying on synchronized clocks are almost
>>>>>>>>>>>>>  >         certainly non-linearizable, because clock
>>>>>>>>>>>>>  >         timestamps cannot be guaranteed to be
>>>>>>>>>>>>>  >         consistent with actual event ordering due to
>>>>>>>>>>>>>  >         clock skew. isn't it?
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >         Thanks!
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     --
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     Justin Cameron
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >     Senior Software Engineer | Instaclustr
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >
>>>>>>>>>>>>>  >
>>>>>>>>>>>>> >     This email has been sent on behalf of Instaclustr
>>>>>>>>>>>>> >     Pty Ltd
>>>>>>>>>>>>> >     (Australia) and Instaclustr Inc (USA).

>>>>>>>>>>>>> >

>>>>>>>>>>>>> >     This email and any attachments may contain
>>>>>>>>>>>>> >     confidential and legally
>>>>>>>>>>>>> >     privileged information.  If you are not the intended
>>>>>>>>>>>>> >     recipient, do
>>>>>>>>>>>>> >     not copy or disclose its content, but please reply
>>>>>>>>>>>>> >     to this email
>>>>>>>>>>>>> >     immediately and highlight the error to the sender
>>>>>>>>>>>>> >     and then
>>>>>>>>>>>>> >     immediately delete the message.

>>>>>>>>>>>>> >

>>>>>>>>>>>>> >

>>>>>>>>>>>>> 

>>>>>>>>>>>> 

>>>>>>>>>>> 

>>>>>>>>>> 

>>>>>>>>> 

>>>>>>>> 

>>>>>>>> 

>>>>>>>> 

>>>>>>>> 

>>>>>>>> -- 

>>>>>>>> Benjamin Roth

>>>>>>>> Prokurist

>>>>>>>> 

>>>>>>>> Jaumo GmbH · www.jaumo.com

>>>>>>>> Wehrstraße 46 · 73035 Göppingen · Germany

>>>>>>>> Phone +49 7161 304880-6[4] · Fax +49 7161 304880-1[5]

>>>>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

>>>>>>>> 

>>>>>>>> 

>>>>>>> 

>>>>>>> 

>>>>>>> 

>>>> 

>>> One thing that always bothered me: Intelligent clients and dynamic
>>> snitch are designed to attempt to route requests to the same node to
>>> attempt to take advantage of cache pinning etc. You would think
>>> under these conditions one could naturally elect a "leader" for a
>>> "group" of keys that could persist for a few hundred milliseconds
>>> and batch up the round trips for a number of operations. Maybe that
>>> is what the distinguished coordinator is in some regards.
>> 




Links:

  1. 
https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22
  2. 
https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22
  3. http://pool.ntp.org/
  4. tel:+49%207161%203048806
  5. tel:+49%207161%203048801

Re: How does cassandra achieve Linearizability?

Reply via email to