Re: How does cassandra achieve Linearizability?

Michael Shuler Wed, 22 Feb 2017 20:05:32 -0800

I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was
a bulk move when removing the cassandra-3.X branch and the 3.x Jira
version. There are likely other new feature tickets that should really
say 4.x.


-- 
Kind regards,
Michael

On 02/22/2017 07:28 PM, Kant Kodali wrote:
> I hope that patch is reviewed as quickly as possible. We use LWT's
> heavily and we are getting a throughput of 600 writes/sec and each write
> is 1KB in our case.
> 
> 
> 
> 
> 
> On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo <edlinuxg...@gmail.com
> <mailto:edlinuxg...@gmail.com>> wrote:
> 
> 
> 
>     On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg <ar...@weisberg.ws
>     <mailto:ar...@weisberg.ws>> wrote:
> 
>         __
>         Hi,
> 
>         No it's not going to be in 3.11.x. The earliest release it could
>         make it into is 4.0.
> 
>         Ariel
> 
>         On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>>         Hi Ariel,
>>
>>         Can we really expect the fix in 3.11.x as the
>>         ticket https://issues.apache.org/jira/browse/CASSANDRA-6246
>>         
>> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>
>>  says?
>>
>>         Thanks,
>>         kant
>>
>>         On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
>>         <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
>>
>>             __
>>             Hi,
>>
>>             That would work and would help a lot with the dueling
>>             proposer issue.
>>
>>             A lot of the leader election stuff is designed to reduce
>>             the number of roundtrips and not just address the dueling
>>             proposer issue. Those will have downtime because it's
>>             there for correctness. Just adding an affinity for a
>>             specific proposer is probably a free lunch.
>>
>>             I don't think you can group keys because the Paxos
>>             proposals are per partition which is why we get linear
>>             scale out for Paxos. I don't believe it's linearizable
>>             across multiple partitions. You can use the clustering key
>>             and deterministically pick one of the live replicas for
>>             that clustering key. Sort the list of replicas by IP, hash
>>             the clustering key, use the hash as an index into the list
>>             of replicas.
>>
>>             Batching is of limited usefulness because we only use
>>             Paxos for CAS I think? So in a batch by definition all but
>>             one will fail the CAS. This is something where a
>>             distinguished coordinator could help by failing the rest
>>             of the contending requests more inexpensively than it
>>             currently does.
>>
>>
>>             Ariel
>>
>>             On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>>>
>>>
>>>             On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
>>>             <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
>>>
>>>                 __
>>>                 Hi,
>>>
>>>                 Classic Paxos doesn't have a leader. There are
>>>                 variants on the original Lamport approach that will
>>>                 elect a leader (or some other variation like Mencius)
>>>                 to improve throughput, latency, and performance under
>>>                 contention. Cassandra implements the approach from
>>>                 the beginning of "Paxos Made Simple"
>>>                 (https://goo.gl/SrP0Wb) with no additional
>>>                 optimizations that I am aware of. There is no
>>>                 distinguished proposer (leader).
>>>
>>>                 That paper does go on to discuss electing a
>>>                 distinguished proposer, but that was never done for
>>>                 C*. I believe it's not considered a good fit for C*
>>>                 philosophically.
>>>
>>>                 Ariel
>>>
>>>                 On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>>>>                 @Ariel Weisberg EPaxos looks very interesting as it
>>>>                 looks like it doesn't need any designated leader for
>>>>                 C* but I am assuming the paxos that is implemented
>>>>                 today for LWT's requires Leader election and If so,
>>>>                 don't we need to have an odd number of nodes or
>>>>                 racks or DC's to satisfy N = 2F + 1 constraint to
>>>>                 tolerate F failures ? I understand it is not needed
>>>>                 when not using LWT's since Cassandra is a
>>>>                 master-less system.
>>>>
>>>>                 On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali
>>>>                 <k...@peernova.com <mailto:k...@peernova.com>> wrote:
>>>>
>>>>                     Thanks Ariel! Yes I knew there are so many
>>>>                     variations and optimizations of Paxos. I just
>>>>                     wanted to see if we had any plans on improving
>>>>                     the existing Paxos implementation and it is
>>>>                     great to see the work is under progress! I am
>>>>                     going to follow that ticket and read up the
>>>>                     references pointed in it
>>>>
>>>>
>>>>                     On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
>>>>                     <ar...@weisberg.ws <mailto:ar...@weisberg.ws>>
>>>>                     wrote:
>>>>
>>>>                         __
>>>>                         Hi,
>>>>
>>>>                         Cassandra's implementation of Paxos doesn't
>>>>                         implement many optimizations that would
>>>>                         drastically improve throughput and latency.
>>>>                         You need consensus, but it doesn't have to
>>>>                         be exorbitantly expensive and fall over
>>>>                         under any kind of contention.
>>>>
>>>>                         For instance you could implement
>>>>                         EPaxos 
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6246
>>>>                         
>>>> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>,
>>>>                         batch multiple operations into the same
>>>>                         Paxos round, have an affinity for a specific
>>>>                         proposer for a specific partition, implement
>>>>                         asynchronous commit, use a more efficient
>>>>                         implementation of the Paxos log, and maybe
>>>>                         other things. 
>>>>
>>>>
>>>>                         Ariel
>>>>
>>>>
>>>>
>>>>                         On Fri, Feb 10, 2017, at 05:31 AM, Benjamin
>>>>                         Roth wrote:
>>>>>                         Hi Kant,
>>>>>
>>>>>                         If you read the published papers about
>>>>>                         Paxos, you will most probably recognize
>>>>>                         that there is no way to "do it better".
>>>>>                         This is a conceptional thing due to the
>>>>>                         nature of distributed systems + the CAP
>>>>>                         theorem.
>>>>>                         If you want A+P in the triangle, then C is
>>>>>                         very expensive. CS is made for A+P mostly
>>>>>                         with tunable C. In ACID databases this is a
>>>>>                         completely different thing as they are
>>>>>                         mostly either not partition tolerant, not
>>>>>                         highly available or not scalable (in a
>>>>>                         distributed manner, not speaking of
>>>>>                         "monolithic super servers").
>>>>>
>>>>>                         There is no free lunch ...
>>>>>
>>>>>
>>>>>                         2017-02-10 11:09 GMT+01:00 Kant Kodali
>>>>>                         <k...@peernova.com <mailto:k...@peernova.com>>:
>>>>>
>>>>>                             "That’s the safety blanket everyone
>>>>>                             wants but is extremely expensive,
>>>>>                             especially in Cassandra."
>>>>>
>>>>>                             yes LWT's are expensive. Are there any
>>>>>                             plans to make this better? 
>>>>>
>>>>>                             On Fri, Feb 10, 2017 at 12:17 AM, Kant
>>>>>                             Kodali <k...@peernova.com
>>>>>                             <mailto:k...@peernova.com>> wrote:
>>>>>
>>>>>                                 Hi Jon,
>>>>>
>>>>>                                 Thanks a lot for your response. I
>>>>>                                 am well aware that the LWW != LWT
>>>>>                                 but I was talking more in terms of
>>>>>                                 LWW with respective to LWT's which
>>>>>                                 I believe you answered. so thanks much!
>>>>>
>>>>>
>>>>>                                 kant
>>>>>
>>>>>
>>>>>                                 On Thu, Feb 9, 2017 at 6:01 PM, Jon
>>>>>                                 Haddad <jonathan.had...@gmail.com
>>>>>                                 <mailto:jonathan.had...@gmail.com>>
>>>>>                                 wrote:
>>>>>
>>>>>                                     LWT != Last Write Wins.  They
>>>>>                                     are totally different.  
>>>>>
>>>>>                                     LWTs give you (assuming you
>>>>>                                     also read at SERIAL) “atomic
>>>>>                                     consistency”, meaning you are
>>>>>                                     able to perform operations
>>>>>                                     atomically and in isolation. 
>>>>>                                     That’s the safety blanket
>>>>>                                     everyone wants but is extremely
>>>>>                                     expensive, especially in
>>>>>                                     Cassandra.  The lightweight
>>>>>                                     part, btw, may be a little
>>>>>                                     optimistic, especially if a key
>>>>>                                     is under contention.  With
>>>>>                                     regard to the “last write” part
>>>>>                                     you’re asking about - w/ LWT
>>>>>                                     Cassandra provides the
>>>>>                                     timestamp and manages it as
>>>>>                                     part of the ballot, and it
>>>>>                                     always is increasing. 
>>>>>                                     See 
>>>>> org.apache.cassandra.service.ClientState#getTimestampForPaxos. 
>>>>>                                     From the code:
>>>>>
>>>>>                                      * Returns a timestamp suitable
>>>>>                                     for paxos given the timestamp
>>>>>                                     of the last known commit (or in
>>>>>                                     progress update).
>>>>>                                      * Paxos ensures that the
>>>>>                                     timestamp it uses for commits
>>>>>                                     respects the serial order of
>>>>>                                     those commits. It does so
>>>>>                                      * by having each replica
>>>>>                                     reject any proposal whose
>>>>>                                     timestamp is not strictly
>>>>>                                     greater than the last proposal it
>>>>>                                      * accepted. So in practice,
>>>>>                                     which timestamp we use for a
>>>>>                                     given proposal doesn't affect
>>>>>                                     correctness but it does
>>>>>                                      * affect the chance of making
>>>>>                                     progress (if we pick a
>>>>>                                     timestamp lower than what has
>>>>>                                     been proposed before, our
>>>>>                                      * new proposal will just get
>>>>>                                     rejected).
>>>>>
>>>>>                                     Effectively paxos removes the
>>>>>                                     ability to use custom
>>>>>                                     timestamps and addresses clock
>>>>>                                     variance by rejecting ballots
>>>>>                                     with timestamps less than what
>>>>>                                     was last seen.  You can learn
>>>>>                                     more by reading through the
>>>>>                                     other comments and code in that
>>>>>                                     file. 
>>>>>
>>>>>                                     Last write wins is a free for
>>>>>                                     all that guarantees you
>>>>>                                     *nothing* except the timestamp
>>>>>                                     is used as a tiebreaker.  Here
>>>>>                                     we acknowledge things like the
>>>>>                                     speed of light as being a real
>>>>>                                     problem that isn’t going away
>>>>>                                     anytime soon.  This problem is
>>>>>                                     sometimes addressed with event
>>>>>                                     sourcing rather than mutating
>>>>>                                     in place.
>>>>>
>>>>>                                     Hope this helps.
>>>>>
>>>>>
>>>>>                                     Jon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>                                     On Feb 9, 2017, at 5:21 PM,
>>>>>>                                     Kant Kodali <k...@peernova.com
>>>>>>                                     <mailto:k...@peernova.com>> wrote:
>>>>>>
>>>>>>                                     @Justin I read this article
>>>>>>                                     
>>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>>>>>>                                     
>>>>>> <http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0>.
>>>>>>                                     And it clearly says
>>>>>>                                     Linearizable consistency can
>>>>>>                                     be achieved with LWT's.  so
>>>>>>                                     should I assume
>>>>>>                                     the Linearizability in the
>>>>>>                                     context of the above article
>>>>>>                                     is possible with LWT's and
>>>>>>                                     synchronization of clocks
>>>>>>                                     through ntpd ? because LWT's
>>>>>>                                     also follow Last Write Wins.
>>>>>>                                     isn't it? Also another
>>>>>>                                     question does most of the
>>>>>>                                     production clusters do setup
>>>>>>                                     ntpd? If so what is the time
>>>>>>                                     it takes to sync? any idea
>>>>>>
>>>>>>                                     @Micheal Schuler Are you
>>>>>>                                     referring to  something like
>>>>>>                                     true time as in
>>>>>>                                     
>>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
>>>>>>                                     
>>>>>> <https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf>?
>>>>>>  
>>>>>>                                     Actually I never heard of
>>>>>>                                     setting up GPS modules and how
>>>>>>                                     that can be helpful. Let me
>>>>>>                                     research on that but good point.
>>>>>>
>>>>>>                                     On Thu, Feb 9, 2017 at 5:09
>>>>>>                                     PM, Michael Shuler
>>>>>>                                     <mich...@pbandjelly.org
>>>>>>                                     <mailto:mich...@pbandjelly.org>>
>>>>>>                                     wrote:
>>>>>>
>>>>>>                                         If you require the best
>>>>>>                                         precision you can get,
>>>>>>                                         setting up a pair of
>>>>>>                                         stratum 1 ntpd masters in
>>>>>>                                         each data center location
>>>>>>                                         with a GPS modules
>>>>>>                                         is not terribly complex.
>>>>>>                                         Low latency and jitter on
>>>>>>                                         servers you manage.
>>>>>>                                         140ms is a long way away
>>>>>>                                         network-wise, and I would
>>>>>>                                         suggest that was a
>>>>>>                                         poor choice of upstream
>>>>>>                                         (probably stratum 2 or 3)
>>>>>>                                         source.
>>>>>>
>>>>>>                                         As Jonathan mentioned,
>>>>>>                                         there's no guarantee from
>>>>>>                                         Cassandra, but if you
>>>>>>                                         need as close as you can
>>>>>>                                         get, you'll probably need
>>>>>>                                         to do it yourself.
>>>>>>
>>>>>>                                         (I run several stratum 2
>>>>>>                                         ntpd servers for
>>>>>>                                         pool.ntp.org
>>>>>>                                         <http://pool.ntp.org/>)
>>>>>>
>>>>>>                                         --
>>>>>>                                         Kind regards,
>>>>>>                                         Michael
>>>>>>
>>>>>>                                         On 02/09/2017 06:47 PM,
>>>>>>                                         Kant Kodali wrote:
>>>>>>                                         > Hi Justin,
>>>>>>                                         >
>>>>>>                                         > There are bunch of
>>>>>>                                         issues w.r.t to
>>>>>>                                         synchronization of clocks
>>>>>>                                         when we
>>>>>>                                         > used ntpd. Also the time
>>>>>>                                         it took to sync the clocks
>>>>>>                                         was approx 140ms
>>>>>>                                         > (don't quote me on it
>>>>>>                                         though because it is
>>>>>>                                         reported by our devops :)
>>>>>>                                         >
>>>>>>                                         > we have multiple clients
>>>>>>                                         (for example bunch of
>>>>>>                                         micro services are
>>>>>>                                         > reading from Cassandra)
>>>>>>                                         I am not sure how one can
>>>>>>                                         achieve
>>>>>>                                         > Linearizability by
>>>>>>                                         setting timestamps on the
>>>>>>                                         clients ? since there is no
>>>>>>                                         > total ordering across
>>>>>>                                         multiple clients.
>>>>>>                                         >
>>>>>>                                         > Thanks!
>>>>>>                                         >
>>>>>>                                         >
>>>>>>                                         > On Thu, Feb 9, 2017 at
>>>>>>                                         4:16 PM, Justin Cameron
>>>>>>                                         <jus...@instaclustr.com
>>>>>>                                         <mailto:jus...@instaclustr.com>
>>>>>>                                         > <mailto:jus...@instaclustr.com
>>>>>>                                         <mailto:jus...@instaclustr.com>>>
>>>>>>                                         wrote:
>>>>>>                                         >
>>>>>>                                         >     Hi Kant,
>>>>>>                                         >
>>>>>>                                         >     Clock
>>>>>>                                         synchronization is
>>>>>>                                         important - you should
>>>>>>                                         ensure that ntpd is
>>>>>>                                         >     properly configured
>>>>>>                                         on all nodes. If your
>>>>>>                                         particular use case is
>>>>>>                                         >     especially sensitive
>>>>>>                                         to out-of-order mutations
>>>>>>                                         it is possible to set
>>>>>>                                         >     timestamps on the
>>>>>>                                         client side using the
>>>>>>                                         >     drivers.
>>>>>>                                         
>>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>>>>>>                                         
>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>
>>>>>>                                         >   
>>>>>>                                          
>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>>>>>>                                         
>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>>
>>>>>>                                         >
>>>>>>                                         >     We use our own NTP
>>>>>>                                         cluster to reduce clock
>>>>>>                                         drift as much as
>>>>>>                                         >     possible, but public
>>>>>>                                         NTP servers are good
>>>>>>                                         enough for most
>>>>>>                                         >     uses.
>>>>>>                                         
>>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>>>>>>                                         
>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>
>>>>>>                                         >   
>>>>>>                                          
>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>>>>>>                                         
>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>>
>>>>>>                                         >
>>>>>>                                         >     Cheers,
>>>>>>                                         >     Justin
>>>>>>                                         >
>>>>>>                                         >     On Thu, 9 Feb 2017
>>>>>>                                         at 16:09 Kant Kodali
>>>>>>                                         <k...@peernova.com
>>>>>>                                         <mailto:k...@peernova.com>
>>>>>>                                         >     <mailto:k...@peernova.com
>>>>>>                                         <mailto:k...@peernova.com>>>
>>>>>>                                         wrote:
>>>>>>                                         >
>>>>>>                                         >         How does
>>>>>>                                         Cassandra achieve
>>>>>>                                         Linearizability with “Last
>>>>>>                                         write
>>>>>>                                         >         wins” (conflict
>>>>>>                                         resolution methods based
>>>>>>                                         on time-of-day clocks) ?
>>>>>>                                         >
>>>>>>                                         >         Relying on
>>>>>>                                         synchronized clocks are
>>>>>>                                         almost certainly
>>>>>>                                         >       
>>>>>>                                          non-linearizable, because
>>>>>>                                         clock timestamps cannot be
>>>>>>                                         guaranteed
>>>>>>                                         >         to be consistent
>>>>>>                                         with actual event ordering
>>>>>>                                         due to clock skew.
>>>>>>                                         >         isn't it?
>>>>>>                                         >
>>>>>>                                         >         Thanks!
>>>>>>                                         >
>>>>>>                                         >     --
>>>>>>                                         >
>>>>>>                                         >     Justin Cameron
>>>>>>                                         >
>>>>>>                                         >     Senior Software
>>>>>>                                         Engineer | Instaclustr
>>>>>>                                         >
>>>>>>                                         >
>>>>>>                                         >
>>>>>>                                         >
>>>>>>                                         >     This email has been
>>>>>>                                         sent on behalf of
>>>>>>                                         Instaclustr Pty Ltd
>>>>>>                                         >     (Australia) and
>>>>>>                                         Instaclustr Inc (USA).
>>>>>>                                         >
>>>>>>                                         >     This email and any
>>>>>>                                         attachments may contain
>>>>>>                                         confidential and legally
>>>>>>                                         >     privileged
>>>>>>                                         information.  If you are
>>>>>>                                         not the intended recipient, do
>>>>>>                                         >     not copy or disclose
>>>>>>                                         its content, but please
>>>>>>                                         reply to this email
>>>>>>                                         >     immediately and
>>>>>>                                         highlight the error to the
>>>>>>                                         sender and then
>>>>>>                                         >     immediately delete
>>>>>>                                         the message.
>>>>>>                                         >
>>>>>>                                         >
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         -- 
>>>>>                         Benjamin Roth
>>>>>                         Prokurist
>>>>>
>>>>>                         Jaumo GmbH · www.jaumo.com
>>>>>                         <http://www.jaumo.com>
>>>>>                         Wehrstraße 46 · 73035 Göppingen · Germany
>>>>>                         Phone +49 7161 304880-6
>>>>>                         <tel:+49%207161%203048806> · Fax +49 7161
>>>>>                         304880-1 <tel:+49%207161%203048801>
>>>>>                         AG Ulm · HRB 731058 · Managing Director:
>>>>>                         Jens Kammerer
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>             One thing that always bothered me: Intelligent clients
>>>             and dynamic snitch are designed to attempt to route
>>>             requests to the same node to attempt to take advantage of
>>>             cache pinning etc. You would think under these conditions
>>>             one could naturally elect a "leader" for a "group" of
>>>             keys that could persist for a few hundred milliseconds
>>>             and batch up the round trips for a number of operations.
>>>             Maybe that is what the distinguished coordinator is in
>>>             some regards.
>>
> 
> 
>     My two cents: The current issue is "feature complete" and the author
>     stated ready for review 2 years ago. But I can see that as the issue
>     stands it forces some hard choices to be made concerning the
>     migration path and in depth code changes. 
> 
>     Also I think there is some question (in my mind) as to how we ensure
>     some of the subtle contracted/non contracted semantics stay in
>     place. As in they work a "certain way" and how confident is everyone
>     that a "better way" does not end up causing some pain for someone
>     using it currently. I assume this as a common case where a feature
>     request is not being engaged with. 
> 
>

Re: How does cassandra achieve Linearizability?

Reply via email to