Re: How does cassandra achieve Linearizability?

Kant Kodali Sun, 26 Feb 2017 20:26:19 -0800

Is there way to apply the commits from this
https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk branch
to Apache Cassandra 3.10 branch? I thought I could just merge these two
branches but looks like there are several trunks so I am confused which
trunk I am merging to?
I want to merge it just to try on my local machine.


Thanks!

On Wed, Feb 22, 2017 at 8:04 PM, Michael Shuler <mich...@pbandjelly.org>
wrote:

> I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was
> a bulk move when removing the cassandra-3.X branch and the 3.x Jira
> version. There are likely other new feature tickets that should really
> say 4.x.
>
> --
> Kind regards,
> Michael
>
> On 02/22/2017 07:28 PM, Kant Kodali wrote:
> > I hope that patch is reviewed as quickly as possible. We use LWT's
> > heavily and we are getting a throughput of 600 writes/sec and each write
> > is 1KB in our case.
> >
> >
> >
> >
> >
> > On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo <edlinuxg...@gmail.com
> > <mailto:edlinuxg...@gmail.com>> wrote:
> >
> >
> >
> >     On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg <ar...@weisberg.ws
> >     <mailto:ar...@weisberg.ws>> wrote:
> >
> >         __
> >         Hi,
> >
> >         No it's not going to be in 3.11.x. The earliest release it could
> >         make it into is 4.0.
> >
> >         Ariel
> >
> >         On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
> >>         Hi Ariel,
> >>
> >>         Can we really expect the fix in 3.11.x as the
> >>         ticket https://issues.apache.org/jira/browse/CASSANDRA-6246
> >>         <https://issues.apache.org/jira/browse/CASSANDRA-6246?
> jql=text%20~%20%22epaxos%22> says?
> >>
> >>         Thanks,
> >>         kant
> >>
> >>         On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg
> >>         <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
> >>
> >>             __
> >>             Hi,
> >>
> >>             That would work and would help a lot with the dueling
> >>             proposer issue.
> >>
> >>             A lot of the leader election stuff is designed to reduce
> >>             the number of roundtrips and not just address the dueling
> >>             proposer issue. Those will have downtime because it's
> >>             there for correctness. Just adding an affinity for a
> >>             specific proposer is probably a free lunch.
> >>
> >>             I don't think you can group keys because the Paxos
> >>             proposals are per partition which is why we get linear
> >>             scale out for Paxos. I don't believe it's linearizable
> >>             across multiple partitions. You can use the clustering key
> >>             and deterministically pick one of the live replicas for
> >>             that clustering key. Sort the list of replicas by IP, hash
> >>             the clustering key, use the hash as an index into the list
> >>             of replicas.
> >>
> >>             Batching is of limited usefulness because we only use
> >>             Paxos for CAS I think? So in a batch by definition all but
> >>             one will fail the CAS. This is something where a
> >>             distinguished coordinator could help by failing the rest
> >>             of the contending requests more inexpensively than it
> >>             currently does.
> >>
> >>
> >>             Ariel
> >>
> >>             On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
> >>>
> >>>
> >>>             On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg
> >>>             <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
> >>>
> >>>                 __
> >>>                 Hi,
> >>>
> >>>                 Classic Paxos doesn't have a leader. There are
> >>>                 variants on the original Lamport approach that will
> >>>                 elect a leader (or some other variation like Mencius)
> >>>                 to improve throughput, latency, and performance under
> >>>                 contention. Cassandra implements the approach from
> >>>                 the beginning of "Paxos Made Simple"
> >>>                 (https://goo.gl/SrP0Wb) with no additional
> >>>                 optimizations that I am aware of. There is no
> >>>                 distinguished proposer (leader).
> >>>
> >>>                 That paper does go on to discuss electing a
> >>>                 distinguished proposer, but that was never done for
> >>>                 C*. I believe it's not considered a good fit for C*
> >>>                 philosophically.
> >>>
> >>>                 Ariel
> >>>
> >>>                 On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
> >>>>                 @Ariel Weisberg EPaxos looks very interesting as it
> >>>>                 looks like it doesn't need any designated leader for
> >>>>                 C* but I am assuming the paxos that is implemented
> >>>>                 today for LWT's requires Leader election and If so,
> >>>>                 don't we need to have an odd number of nodes or
> >>>>                 racks or DC's to satisfy N = 2F + 1 constraint to
> >>>>                 tolerate F failures ? I understand it is not needed
> >>>>                 when not using LWT's since Cassandra is a
> >>>>                 master-less system.
> >>>>
> >>>>                 On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali
> >>>>                 <k...@peernova.com <mailto:k...@peernova.com>> wrote:
> >>>>
> >>>>                     Thanks Ariel! Yes I knew there are so many
> >>>>                     variations and optimizations of Paxos. I just
> >>>>                     wanted to see if we had any plans on improving
> >>>>                     the existing Paxos implementation and it is
> >>>>                     great to see the work is under progress! I am
> >>>>                     going to follow that ticket and read up the
> >>>>                     references pointed in it
> >>>>
> >>>>
> >>>>                     On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
> >>>>                     <ar...@weisberg.ws <mailto:ar...@weisberg.ws>>
> >>>>                     wrote:
> >>>>
> >>>>                         __
> >>>>                         Hi,
> >>>>
> >>>>                         Cassandra's implementation of Paxos doesn't
> >>>>                         implement many optimizations that would
> >>>>                         drastically improve throughput and latency.
> >>>>                         You need consensus, but it doesn't have to
> >>>>                         be exorbitantly expensive and fall over
> >>>>                         under any kind of contention.
> >>>>
> >>>>                         For instance you could implement
> >>>>                         EPaxos https://issues.apache.org/
> jira/browse/CASSANDRA-6246
> >>>>                         <https://issues.apache.org/
> jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>,
> >>>>                         batch multiple operations into the same
> >>>>                         Paxos round, have an affinity for a specific
> >>>>                         proposer for a specific partition, implement
> >>>>                         asynchronous commit, use a more efficient
> >>>>                         implementation of the Paxos log, and maybe
> >>>>                         other things.
> >>>>
> >>>>
> >>>>                         Ariel
> >>>>
> >>>>
> >>>>
> >>>>                         On Fri, Feb 10, 2017, at 05:31 AM, Benjamin
> >>>>                         Roth wrote:
> >>>>>                         Hi Kant,
> >>>>>
> >>>>>                         If you read the published papers about
> >>>>>                         Paxos, you will most probably recognize
> >>>>>                         that there is no way to "do it better".
> >>>>>                         This is a conceptional thing due to the
> >>>>>                         nature of distributed systems + the CAP
> >>>>>                         theorem.
> >>>>>                         If you want A+P in the triangle, then C is
> >>>>>                         very expensive. CS is made for A+P mostly
> >>>>>                         with tunable C. In ACID databases this is a
> >>>>>                         completely different thing as they are
> >>>>>                         mostly either not partition tolerant, not
> >>>>>                         highly available or not scalable (in a
> >>>>>                         distributed manner, not speaking of
> >>>>>                         "monolithic super servers").
> >>>>>
> >>>>>                         There is no free lunch ...
> >>>>>
> >>>>>
> >>>>>                         2017-02-10 11:09 GMT+01:00 Kant Kodali
> >>>>>                         <k...@peernova.com <mailto:k...@peernova.com
> >>:
> >>>>>
> >>>>>                             "That’s the safety blanket everyone
> >>>>>                             wants but is extremely expensive,
> >>>>>                             especially in Cassandra."
> >>>>>
> >>>>>                             yes LWT's are expensive. Are there any
> >>>>>                             plans to make this better?
> >>>>>
> >>>>>                             On Fri, Feb 10, 2017 at 12:17 AM, Kant
> >>>>>                             Kodali <k...@peernova.com
> >>>>>                             <mailto:k...@peernova.com>> wrote:
> >>>>>
> >>>>>                                 Hi Jon,
> >>>>>
> >>>>>                                 Thanks a lot for your response. I
> >>>>>                                 am well aware that the LWW != LWT
> >>>>>                                 but I was talking more in terms of
> >>>>>                                 LWW with respective to LWT's which
> >>>>>                                 I believe you answered. so thanks
> much!
> >>>>>
> >>>>>
> >>>>>                                 kant
> >>>>>
> >>>>>
> >>>>>                                 On Thu, Feb 9, 2017 at 6:01 PM, Jon
> >>>>>                                 Haddad <jonathan.had...@gmail.com
> >>>>>                                 <mailto:jonathan.had...@gmail.com>>
> >>>>>                                 wrote:
> >>>>>
> >>>>>                                     LWT != Last Write Wins.  They
> >>>>>                                     are totally different.
> >>>>>
> >>>>>                                     LWTs give you (assuming you
> >>>>>                                     also read at SERIAL) “atomic
> >>>>>                                     consistency”, meaning you are
> >>>>>                                     able to perform operations
> >>>>>                                     atomically and in isolation.
> >>>>>                                     That’s the safety blanket
> >>>>>                                     everyone wants but is extremely
> >>>>>                                     expensive, especially in
> >>>>>                                     Cassandra.  The lightweight
> >>>>>                                     part, btw, may be a little
> >>>>>                                     optimistic, especially if a key
> >>>>>                                     is under contention.  With
> >>>>>                                     regard to the “last write” part
> >>>>>                                     you’re asking about - w/ LWT
> >>>>>                                     Cassandra provides the
> >>>>>                                     timestamp and manages it as
> >>>>>                                     part of the ballot, and it
> >>>>>                                     always is increasing.
> >>>>>                                     See org.apache.cassandra.service.
> ClientState#getTimestampForPaxos.
> >>>>>                                     From the code:
> >>>>>
> >>>>>                                      * Returns a timestamp suitable
> >>>>>                                     for paxos given the timestamp
> >>>>>                                     of the last known commit (or in
> >>>>>                                     progress update).
> >>>>>                                      * Paxos ensures that the
> >>>>>                                     timestamp it uses for commits
> >>>>>                                     respects the serial order of
> >>>>>                                     those commits. It does so
> >>>>>                                      * by having each replica
> >>>>>                                     reject any proposal whose
> >>>>>                                     timestamp is not strictly
> >>>>>                                     greater than the last proposal it
> >>>>>                                      * accepted. So in practice,
> >>>>>                                     which timestamp we use for a
> >>>>>                                     given proposal doesn't affect
> >>>>>                                     correctness but it does
> >>>>>                                      * affect the chance of making
> >>>>>                                     progress (if we pick a
> >>>>>                                     timestamp lower than what has
> >>>>>                                     been proposed before, our
> >>>>>                                      * new proposal will just get
> >>>>>                                     rejected).
> >>>>>
> >>>>>                                     Effectively paxos removes the
> >>>>>                                     ability to use custom
> >>>>>                                     timestamps and addresses clock
> >>>>>                                     variance by rejecting ballots
> >>>>>                                     with timestamps less than what
> >>>>>                                     was last seen.  You can learn
> >>>>>                                     more by reading through the
> >>>>>                                     other comments and code in that
> >>>>>                                     file.
> >>>>>
> >>>>>                                     Last write wins is a free for
> >>>>>                                     all that guarantees you
> >>>>>                                     *nothing* except the timestamp
> >>>>>                                     is used as a tiebreaker.  Here
> >>>>>                                     we acknowledge things like the
> >>>>>                                     speed of light as being a real
> >>>>>                                     problem that isn’t going away
> >>>>>                                     anytime soon.  This problem is
> >>>>>                                     sometimes addressed with event
> >>>>>                                     sourcing rather than mutating
> >>>>>                                     in place.
> >>>>>
> >>>>>                                     Hope this helps.
> >>>>>
> >>>>>
> >>>>>                                     Jon
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>                                     On Feb 9, 2017, at 5:21 PM,
> >>>>>>                                     Kant Kodali <k...@peernova.com
> >>>>>>                                     <mailto:k...@peernova.com>>
> wrote:
> >>>>>>
> >>>>>>                                     @Justin I read this article
> >>>>>>                                     http://www.datastax.com/dev/
> blog/lightweight-transactions-in-cassandra-2-0
> >>>>>>                                     <http://www.datastax.com/dev/
> blog/lightweight-transactions-in-cassandra-2-0>.
> >>>>>>                                     And it clearly says
> >>>>>>                                     Linearizable consistency can
> >>>>>>                                     be achieved with LWT's.  so
> >>>>>>                                     should I assume
> >>>>>>                                     the Linearizability in the
> >>>>>>                                     context of the above article
> >>>>>>                                     is possible with LWT's and
> >>>>>>                                     synchronization of clocks
> >>>>>>                                     through ntpd ? because LWT's
> >>>>>>                                     also follow Last Write Wins.
> >>>>>>                                     isn't it? Also another
> >>>>>>                                     question does most of the
> >>>>>>                                     production clusters do setup
> >>>>>>                                     ntpd? If so what is the time
> >>>>>>                                     it takes to sync? any idea
> >>>>>>
> >>>>>>                                     @Micheal Schuler Are you
> >>>>>>                                     referring to  something like
> >>>>>>                                     true time as in
> >>>>>>                                     https://static.
> googleusercontent.com/media/research.google.com/en//
> archive/spanner-osdi2012.pdf
> >>>>>>                                     <https://static.
> googleusercontent.com/media/research.google.com/en//
> archive/spanner-osdi2012.pdf>?
> >>>>>>                                     Actually I never heard of
> >>>>>>                                     setting up GPS modules and how
> >>>>>>                                     that can be helpful. Let me
> >>>>>>                                     research on that but good point.
> >>>>>>
> >>>>>>                                     On Thu, Feb 9, 2017 at 5:09
> >>>>>>                                     PM, Michael Shuler
> >>>>>>                                     <mich...@pbandjelly.org
> >>>>>>                                     <mailto:mich...@pbandjelly.org
> >>
> >>>>>>                                     wrote:
> >>>>>>
> >>>>>>                                         If you require the best
> >>>>>>                                         precision you can get,
> >>>>>>                                         setting up a pair of
> >>>>>>                                         stratum 1 ntpd masters in
> >>>>>>                                         each data center location
> >>>>>>                                         with a GPS modules
> >>>>>>                                         is not terribly complex.
> >>>>>>                                         Low latency and jitter on
> >>>>>>                                         servers you manage.
> >>>>>>                                         140ms is a long way away
> >>>>>>                                         network-wise, and I would
> >>>>>>                                         suggest that was a
> >>>>>>                                         poor choice of upstream
> >>>>>>                                         (probably stratum 2 or 3)
> >>>>>>                                         source.
> >>>>>>
> >>>>>>                                         As Jonathan mentioned,
> >>>>>>                                         there's no guarantee from
> >>>>>>                                         Cassandra, but if you
> >>>>>>                                         need as close as you can
> >>>>>>                                         get, you'll probably need
> >>>>>>                                         to do it yourself.
> >>>>>>
> >>>>>>                                         (I run several stratum 2
> >>>>>>                                         ntpd servers for
> >>>>>>                                         pool.ntp.org
> >>>>>>                                         <http://pool.ntp.org/>)
> >>>>>>
> >>>>>>                                         --
> >>>>>>                                         Kind regards,
> >>>>>>                                         Michael
> >>>>>>
> >>>>>>                                         On 02/09/2017 06:47 PM,
> >>>>>>                                         Kant Kodali wrote:
> >>>>>>                                         > Hi Justin,
> >>>>>>                                         >
> >>>>>>                                         > There are bunch of
> >>>>>>                                         issues w.r.t to
> >>>>>>                                         synchronization of clocks
> >>>>>>                                         when we
> >>>>>>                                         > used ntpd. Also the time
> >>>>>>                                         it took to sync the clocks
> >>>>>>                                         was approx 140ms
> >>>>>>                                         > (don't quote me on it
> >>>>>>                                         though because it is
> >>>>>>                                         reported by our devops :)
> >>>>>>                                         >
> >>>>>>                                         > we have multiple clients
> >>>>>>                                         (for example bunch of
> >>>>>>                                         micro services are
> >>>>>>                                         > reading from Cassandra)
> >>>>>>                                         I am not sure how one can
> >>>>>>                                         achieve
> >>>>>>                                         > Linearizability by
> >>>>>>                                         setting timestamps on the
> >>>>>>                                         clients ? since there is no
> >>>>>>                                         > total ordering across
> >>>>>>                                         multiple clients.
> >>>>>>                                         >
> >>>>>>                                         > Thanks!
> >>>>>>                                         >
> >>>>>>                                         >
> >>>>>>                                         > On Thu, Feb 9, 2017 at
> >>>>>>                                         4:16 PM, Justin Cameron
> >>>>>>                                         <jus...@instaclustr.com
> >>>>>>                                         <mailto:
> jus...@instaclustr.com>
> >>>>>>                                         > <mailto:
> jus...@instaclustr.com
> >>>>>>                                         <mailto:
> jus...@instaclustr.com>>>
> >>>>>>                                         wrote:
> >>>>>>                                         >
> >>>>>>                                         >     Hi Kant,
> >>>>>>                                         >
> >>>>>>                                         >     Clock
> >>>>>>                                         synchronization is
> >>>>>>                                         important - you should
> >>>>>>                                         ensure that ntpd is
> >>>>>>                                         >     properly configured
> >>>>>>                                         on all nodes. If your
> >>>>>>                                         particular use case is
> >>>>>>                                         >     especially sensitive
> >>>>>>                                         to out-of-order mutations
> >>>>>>                                         it is possible to set
> >>>>>>                                         >     timestamps on the
> >>>>>>                                         client side using the
> >>>>>>                                         >     drivers.
> >>>>>>
> https://docs.datastax.com/en/developer/java-driver/3.1/
> manual/query_timestamps/
> >>>>>>                                         <
> https://docs.datastax.com/en/developer/java-driver/3.1/
> manual/query_timestamps/>
> >>>>>>                                         >
> >>>>>>                                          <
> https://docs.datastax.com/en/developer/java-driver/3.1/
> manual/query_timestamps/
> >>>>>>                                         <
> https://docs.datastax.com/en/developer/java-driver/3.1/
> manual/query_timestamps/>>
> >>>>>>                                         >
> >>>>>>                                         >     We use our own NTP
> >>>>>>                                         cluster to reduce clock
> >>>>>>                                         drift as much as
> >>>>>>                                         >     possible, but public
> >>>>>>                                         NTP servers are good
> >>>>>>                                         enough for most
> >>>>>>                                         >     uses.
> >>>>>>
> https://www.instaclustr.com/blog/2015/11/05/apache-
> cassandra-synchronization/
> >>>>>>                                         <
> https://www.instaclustr.com/blog/2015/11/05/apache-
> cassandra-synchronization/>
> >>>>>>                                         >
> >>>>>>                                          <
> https://www.instaclustr.com/blog/2015/11/05/apache-
> cassandra-synchronization/
> >>>>>>                                         <
> https://www.instaclustr.com/blog/2015/11/05/apache-
> cassandra-synchronization/>>
> >>>>>>                                         >
> >>>>>>                                         >     Cheers,
> >>>>>>                                         >     Justin
> >>>>>>                                         >
> >>>>>>                                         >     On Thu, 9 Feb 2017
> >>>>>>                                         at 16:09 Kant Kodali
> >>>>>>                                         <k...@peernova.com
> >>>>>>                                         <mailto:k...@peernova.com>
> >>>>>>                                         >     <mailto:
> k...@peernova.com
> >>>>>>                                         <mailto:k...@peernova.com
> >>>
> >>>>>>                                         wrote:
> >>>>>>                                         >
> >>>>>>                                         >         How does
> >>>>>>                                         Cassandra achieve
> >>>>>>                                         Linearizability with “Last
> >>>>>>                                         write
> >>>>>>                                         >         wins” (conflict
> >>>>>>                                         resolution methods based
> >>>>>>                                         on time-of-day clocks) ?
> >>>>>>                                         >
> >>>>>>                                         >         Relying on
> >>>>>>                                         synchronized clocks are
> >>>>>>                                         almost certainly
> >>>>>>                                         >
> >>>>>>                                          non-linearizable, because
> >>>>>>                                         clock timestamps cannot be
> >>>>>>                                         guaranteed
> >>>>>>                                         >         to be consistent
> >>>>>>                                         with actual event ordering
> >>>>>>                                         due to clock skew.
> >>>>>>                                         >         isn't it?
> >>>>>>                                         >
> >>>>>>                                         >         Thanks!
> >>>>>>                                         >
> >>>>>>                                         >     --
> >>>>>>                                         >
> >>>>>>                                         >     Justin Cameron
> >>>>>>                                         >
> >>>>>>                                         >     Senior Software
> >>>>>>                                         Engineer | Instaclustr
> >>>>>>                                         >
> >>>>>>                                         >
> >>>>>>                                         >
> >>>>>>                                         >
> >>>>>>                                         >     This email has been
> >>>>>>                                         sent on behalf of
> >>>>>>                                         Instaclustr Pty Ltd
> >>>>>>                                         >     (Australia) and
> >>>>>>                                         Instaclustr Inc (USA).
> >>>>>>                                         >
> >>>>>>                                         >     This email and any
> >>>>>>                                         attachments may contain
> >>>>>>                                         confidential and legally
> >>>>>>                                         >     privileged
> >>>>>>                                         information.  If you are
> >>>>>>                                         not the intended recipient,
> do
> >>>>>>                                         >     not copy or disclose
> >>>>>>                                         its content, but please
> >>>>>>                                         reply to this email
> >>>>>>                                         >     immediately and
> >>>>>>                                         highlight the error to the
> >>>>>>                                         sender and then
> >>>>>>                                         >     immediately delete
> >>>>>>                                         the message.
> >>>>>>                                         >
> >>>>>>                                         >
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>                         --
> >>>>>                         Benjamin Roth
> >>>>>                         Prokurist
> >>>>>
> >>>>>                         Jaumo GmbH · www.jaumo.com
> >>>>>                         <http://www.jaumo.com>
> >>>>>                         Wehrstraße 46 · 73035 Göppingen · Germany
> >>>>>                         Phone +49 7161 304880-6
> >>>>>                         <tel:+49%207161%203048806> · Fax +49 7161
> >>>>>                         304880-1 <tel:+49%207161%203048801>
> >>>>>                         AG Ulm · HRB 731058 · Managing Director:
> >>>>>                         Jens Kammerer
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>             One thing that always bothered me: Intelligent clients
> >>>             and dynamic snitch are designed to attempt to route
> >>>             requests to the same node to attempt to take advantage of
> >>>             cache pinning etc. You would think under these conditions
> >>>             one could naturally elect a "leader" for a "group" of
> >>>             keys that could persist for a few hundred milliseconds
> >>>             and batch up the round trips for a number of operations.
> >>>             Maybe that is what the distinguished coordinator is in
> >>>             some regards.
> >>
> >
> >
> >     My two cents: The current issue is "feature complete" and the author
> >     stated ready for review 2 years ago. But I can see that as the issue
> >     stands it forces some hard choices to be made concerning the
> >     migration path and in depth code changes.
> >
> >     Also I think there is some question (in my mind) as to how we ensure
> >     some of the subtle contracted/non contracted semantics stay in
> >     place. As in they work a "certain way" and how confident is everyone
> >     that a "better way" does not end up causing some pain for someone
> >     using it currently. I assume this as a common case where a feature
> >     request is not being engaged with.
> >
> >
>
>

Re: How does cassandra achieve Linearizability?

Reply via email to