Is there way to apply the commits from this https://github.com/bdeggleston/cassandra/tree/CASSANDRA-6246-trunk branch to Apache Cassandra 3.10 branch? I thought I could just merge these two branches but looks like there are several trunks so I am confused which trunk I am merging to? I want to merge it just to try on my local machine.
Thanks! On Wed, Feb 22, 2017 at 8:04 PM, Michael Shuler <mich...@pbandjelly.org> wrote: > I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was > a bulk move when removing the cassandra-3.X branch and the 3.x Jira > version. There are likely other new feature tickets that should really > say 4.x. > > -- > Kind regards, > Michael > > On 02/22/2017 07:28 PM, Kant Kodali wrote: > > I hope that patch is reviewed as quickly as possible. We use LWT's > > heavily and we are getting a throughput of 600 writes/sec and each write > > is 1KB in our case. > > > > > > > > > > > > On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo <edlinuxg...@gmail.com > > <mailto:edlinuxg...@gmail.com>> wrote: > > > > > > > > On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg <ar...@weisberg.ws > > <mailto:ar...@weisberg.ws>> wrote: > > > > __ > > Hi, > > > > No it's not going to be in 3.11.x. The earliest release it could > > make it into is 4.0. > > > > Ariel > > > > On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote: > >> Hi Ariel, > >> > >> Can we really expect the fix in 3.11.x as the > >> ticket https://issues.apache.org/jira/browse/CASSANDRA-6246 > >> <https://issues.apache.org/jira/browse/CASSANDRA-6246? > jql=text%20~%20%22epaxos%22> says? > >> > >> Thanks, > >> kant > >> > >> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg > >> <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote: > >> > >> __ > >> Hi, > >> > >> That would work and would help a lot with the dueling > >> proposer issue. > >> > >> A lot of the leader election stuff is designed to reduce > >> the number of roundtrips and not just address the dueling > >> proposer issue. Those will have downtime because it's > >> there for correctness. Just adding an affinity for a > >> specific proposer is probably a free lunch. > >> > >> I don't think you can group keys because the Paxos > >> proposals are per partition which is why we get linear > >> scale out for Paxos. I don't believe it's linearizable > >> across multiple partitions. You can use the clustering key > >> and deterministically pick one of the live replicas for > >> that clustering key. Sort the list of replicas by IP, hash > >> the clustering key, use the hash as an index into the list > >> of replicas. > >> > >> Batching is of limited usefulness because we only use > >> Paxos for CAS I think? So in a batch by definition all but > >> one will fail the CAS. This is something where a > >> distinguished coordinator could help by failing the rest > >> of the contending requests more inexpensively than it > >> currently does. > >> > >> > >> Ariel > >> > >> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote: > >>> > >>> > >>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg > >>> <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote: > >>> > >>> __ > >>> Hi, > >>> > >>> Classic Paxos doesn't have a leader. There are > >>> variants on the original Lamport approach that will > >>> elect a leader (or some other variation like Mencius) > >>> to improve throughput, latency, and performance under > >>> contention. Cassandra implements the approach from > >>> the beginning of "Paxos Made Simple" > >>> (https://goo.gl/SrP0Wb) with no additional > >>> optimizations that I am aware of. There is no > >>> distinguished proposer (leader). > >>> > >>> That paper does go on to discuss electing a > >>> distinguished proposer, but that was never done for > >>> C*. I believe it's not considered a good fit for C* > >>> philosophically. > >>> > >>> Ariel > >>> > >>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote: > >>>> @Ariel Weisberg EPaxos looks very interesting as it > >>>> looks like it doesn't need any designated leader for > >>>> C* but I am assuming the paxos that is implemented > >>>> today for LWT's requires Leader election and If so, > >>>> don't we need to have an odd number of nodes or > >>>> racks or DC's to satisfy N = 2F + 1 constraint to > >>>> tolerate F failures ? I understand it is not needed > >>>> when not using LWT's since Cassandra is a > >>>> master-less system. > >>>> > >>>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali > >>>> <k...@peernova.com <mailto:k...@peernova.com>> wrote: > >>>> > >>>> Thanks Ariel! Yes I knew there are so many > >>>> variations and optimizations of Paxos. I just > >>>> wanted to see if we had any plans on improving > >>>> the existing Paxos implementation and it is > >>>> great to see the work is under progress! I am > >>>> going to follow that ticket and read up the > >>>> references pointed in it > >>>> > >>>> > >>>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg > >>>> <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> > >>>> wrote: > >>>> > >>>> __ > >>>> Hi, > >>>> > >>>> Cassandra's implementation of Paxos doesn't > >>>> implement many optimizations that would > >>>> drastically improve throughput and latency. > >>>> You need consensus, but it doesn't have to > >>>> be exorbitantly expensive and fall over > >>>> under any kind of contention. > >>>> > >>>> For instance you could implement > >>>> EPaxos https://issues.apache.org/ > jira/browse/CASSANDRA-6246 > >>>> <https://issues.apache.org/ > jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>, > >>>> batch multiple operations into the same > >>>> Paxos round, have an affinity for a specific > >>>> proposer for a specific partition, implement > >>>> asynchronous commit, use a more efficient > >>>> implementation of the Paxos log, and maybe > >>>> other things. > >>>> > >>>> > >>>> Ariel > >>>> > >>>> > >>>> > >>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin > >>>> Roth wrote: > >>>>> Hi Kant, > >>>>> > >>>>> If you read the published papers about > >>>>> Paxos, you will most probably recognize > >>>>> that there is no way to "do it better". > >>>>> This is a conceptional thing due to the > >>>>> nature of distributed systems + the CAP > >>>>> theorem. > >>>>> If you want A+P in the triangle, then C is > >>>>> very expensive. CS is made for A+P mostly > >>>>> with tunable C. In ACID databases this is a > >>>>> completely different thing as they are > >>>>> mostly either not partition tolerant, not > >>>>> highly available or not scalable (in a > >>>>> distributed manner, not speaking of > >>>>> "monolithic super servers"). > >>>>> > >>>>> There is no free lunch ... > >>>>> > >>>>> > >>>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali > >>>>> <k...@peernova.com <mailto:k...@peernova.com > >>: > >>>>> > >>>>> "That’s the safety blanket everyone > >>>>> wants but is extremely expensive, > >>>>> especially in Cassandra." > >>>>> > >>>>> yes LWT's are expensive. Are there any > >>>>> plans to make this better? > >>>>> > >>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant > >>>>> Kodali <k...@peernova.com > >>>>> <mailto:k...@peernova.com>> wrote: > >>>>> > >>>>> Hi Jon, > >>>>> > >>>>> Thanks a lot for your response. I > >>>>> am well aware that the LWW != LWT > >>>>> but I was talking more in terms of > >>>>> LWW with respective to LWT's which > >>>>> I believe you answered. so thanks > much! > >>>>> > >>>>> > >>>>> kant > >>>>> > >>>>> > >>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon > >>>>> Haddad <jonathan.had...@gmail.com > >>>>> <mailto:jonathan.had...@gmail.com>> > >>>>> wrote: > >>>>> > >>>>> LWT != Last Write Wins. They > >>>>> are totally different. > >>>>> > >>>>> LWTs give you (assuming you > >>>>> also read at SERIAL) “atomic > >>>>> consistency”, meaning you are > >>>>> able to perform operations > >>>>> atomically and in isolation. > >>>>> That’s the safety blanket > >>>>> everyone wants but is extremely > >>>>> expensive, especially in > >>>>> Cassandra. The lightweight > >>>>> part, btw, may be a little > >>>>> optimistic, especially if a key > >>>>> is under contention. With > >>>>> regard to the “last write” part > >>>>> you’re asking about - w/ LWT > >>>>> Cassandra provides the > >>>>> timestamp and manages it as > >>>>> part of the ballot, and it > >>>>> always is increasing. > >>>>> See org.apache.cassandra.service. > ClientState#getTimestampForPaxos. > >>>>> From the code: > >>>>> > >>>>> * Returns a timestamp suitable > >>>>> for paxos given the timestamp > >>>>> of the last known commit (or in > >>>>> progress update). > >>>>> * Paxos ensures that the > >>>>> timestamp it uses for commits > >>>>> respects the serial order of > >>>>> those commits. It does so > >>>>> * by having each replica > >>>>> reject any proposal whose > >>>>> timestamp is not strictly > >>>>> greater than the last proposal it > >>>>> * accepted. So in practice, > >>>>> which timestamp we use for a > >>>>> given proposal doesn't affect > >>>>> correctness but it does > >>>>> * affect the chance of making > >>>>> progress (if we pick a > >>>>> timestamp lower than what has > >>>>> been proposed before, our > >>>>> * new proposal will just get > >>>>> rejected). > >>>>> > >>>>> Effectively paxos removes the > >>>>> ability to use custom > >>>>> timestamps and addresses clock > >>>>> variance by rejecting ballots > >>>>> with timestamps less than what > >>>>> was last seen. You can learn > >>>>> more by reading through the > >>>>> other comments and code in that > >>>>> file. > >>>>> > >>>>> Last write wins is a free for > >>>>> all that guarantees you > >>>>> *nothing* except the timestamp > >>>>> is used as a tiebreaker. Here > >>>>> we acknowledge things like the > >>>>> speed of light as being a real > >>>>> problem that isn’t going away > >>>>> anytime soon. This problem is > >>>>> sometimes addressed with event > >>>>> sourcing rather than mutating > >>>>> in place. > >>>>> > >>>>> Hope this helps. > >>>>> > >>>>> > >>>>> Jon > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> On Feb 9, 2017, at 5:21 PM, > >>>>>> Kant Kodali <k...@peernova.com > >>>>>> <mailto:k...@peernova.com>> > wrote: > >>>>>> > >>>>>> @Justin I read this article > >>>>>> http://www.datastax.com/dev/ > blog/lightweight-transactions-in-cassandra-2-0 > >>>>>> <http://www.datastax.com/dev/ > blog/lightweight-transactions-in-cassandra-2-0>. > >>>>>> And it clearly says > >>>>>> Linearizable consistency can > >>>>>> be achieved with LWT's. so > >>>>>> should I assume > >>>>>> the Linearizability in the > >>>>>> context of the above article > >>>>>> is possible with LWT's and > >>>>>> synchronization of clocks > >>>>>> through ntpd ? because LWT's > >>>>>> also follow Last Write Wins. > >>>>>> isn't it? Also another > >>>>>> question does most of the > >>>>>> production clusters do setup > >>>>>> ntpd? If so what is the time > >>>>>> it takes to sync? any idea > >>>>>> > >>>>>> @Micheal Schuler Are you > >>>>>> referring to something like > >>>>>> true time as in > >>>>>> https://static. > googleusercontent.com/media/research.google.com/en// > archive/spanner-osdi2012.pdf > >>>>>> <https://static. > googleusercontent.com/media/research.google.com/en// > archive/spanner-osdi2012.pdf>? > >>>>>> Actually I never heard of > >>>>>> setting up GPS modules and how > >>>>>> that can be helpful. Let me > >>>>>> research on that but good point. > >>>>>> > >>>>>> On Thu, Feb 9, 2017 at 5:09 > >>>>>> PM, Michael Shuler > >>>>>> <mich...@pbandjelly.org > >>>>>> <mailto:mich...@pbandjelly.org > >> > >>>>>> wrote: > >>>>>> > >>>>>> If you require the best > >>>>>> precision you can get, > >>>>>> setting up a pair of > >>>>>> stratum 1 ntpd masters in > >>>>>> each data center location > >>>>>> with a GPS modules > >>>>>> is not terribly complex. > >>>>>> Low latency and jitter on > >>>>>> servers you manage. > >>>>>> 140ms is a long way away > >>>>>> network-wise, and I would > >>>>>> suggest that was a > >>>>>> poor choice of upstream > >>>>>> (probably stratum 2 or 3) > >>>>>> source. > >>>>>> > >>>>>> As Jonathan mentioned, > >>>>>> there's no guarantee from > >>>>>> Cassandra, but if you > >>>>>> need as close as you can > >>>>>> get, you'll probably need > >>>>>> to do it yourself. > >>>>>> > >>>>>> (I run several stratum 2 > >>>>>> ntpd servers for > >>>>>> pool.ntp.org > >>>>>> <http://pool.ntp.org/>) > >>>>>> > >>>>>> -- > >>>>>> Kind regards, > >>>>>> Michael > >>>>>> > >>>>>> On 02/09/2017 06:47 PM, > >>>>>> Kant Kodali wrote: > >>>>>> > Hi Justin, > >>>>>> > > >>>>>> > There are bunch of > >>>>>> issues w.r.t to > >>>>>> synchronization of clocks > >>>>>> when we > >>>>>> > used ntpd. Also the time > >>>>>> it took to sync the clocks > >>>>>> was approx 140ms > >>>>>> > (don't quote me on it > >>>>>> though because it is > >>>>>> reported by our devops :) > >>>>>> > > >>>>>> > we have multiple clients > >>>>>> (for example bunch of > >>>>>> micro services are > >>>>>> > reading from Cassandra) > >>>>>> I am not sure how one can > >>>>>> achieve > >>>>>> > Linearizability by > >>>>>> setting timestamps on the > >>>>>> clients ? since there is no > >>>>>> > total ordering across > >>>>>> multiple clients. > >>>>>> > > >>>>>> > Thanks! > >>>>>> > > >>>>>> > > >>>>>> > On Thu, Feb 9, 2017 at > >>>>>> 4:16 PM, Justin Cameron > >>>>>> <jus...@instaclustr.com > >>>>>> <mailto: > jus...@instaclustr.com> > >>>>>> > <mailto: > jus...@instaclustr.com > >>>>>> <mailto: > jus...@instaclustr.com>>> > >>>>>> wrote: > >>>>>> > > >>>>>> > Hi Kant, > >>>>>> > > >>>>>> > Clock > >>>>>> synchronization is > >>>>>> important - you should > >>>>>> ensure that ntpd is > >>>>>> > properly configured > >>>>>> on all nodes. If your > >>>>>> particular use case is > >>>>>> > especially sensitive > >>>>>> to out-of-order mutations > >>>>>> it is possible to set > >>>>>> > timestamps on the > >>>>>> client side using the > >>>>>> > drivers. > >>>>>> > https://docs.datastax.com/en/developer/java-driver/3.1/ > manual/query_timestamps/ > >>>>>> < > https://docs.datastax.com/en/developer/java-driver/3.1/ > manual/query_timestamps/> > >>>>>> > > >>>>>> < > https://docs.datastax.com/en/developer/java-driver/3.1/ > manual/query_timestamps/ > >>>>>> < > https://docs.datastax.com/en/developer/java-driver/3.1/ > manual/query_timestamps/>> > >>>>>> > > >>>>>> > We use our own NTP > >>>>>> cluster to reduce clock > >>>>>> drift as much as > >>>>>> > possible, but public > >>>>>> NTP servers are good > >>>>>> enough for most > >>>>>> > uses. > >>>>>> > https://www.instaclustr.com/blog/2015/11/05/apache- > cassandra-synchronization/ > >>>>>> < > https://www.instaclustr.com/blog/2015/11/05/apache- > cassandra-synchronization/> > >>>>>> > > >>>>>> < > https://www.instaclustr.com/blog/2015/11/05/apache- > cassandra-synchronization/ > >>>>>> < > https://www.instaclustr.com/blog/2015/11/05/apache- > cassandra-synchronization/>> > >>>>>> > > >>>>>> > Cheers, > >>>>>> > Justin > >>>>>> > > >>>>>> > On Thu, 9 Feb 2017 > >>>>>> at 16:09 Kant Kodali > >>>>>> <k...@peernova.com > >>>>>> <mailto:k...@peernova.com> > >>>>>> > <mailto: > k...@peernova.com > >>>>>> <mailto:k...@peernova.com > >>> > >>>>>> wrote: > >>>>>> > > >>>>>> > How does > >>>>>> Cassandra achieve > >>>>>> Linearizability with “Last > >>>>>> write > >>>>>> > wins” (conflict > >>>>>> resolution methods based > >>>>>> on time-of-day clocks) ? > >>>>>> > > >>>>>> > Relying on > >>>>>> synchronized clocks are > >>>>>> almost certainly > >>>>>> > > >>>>>> non-linearizable, because > >>>>>> clock timestamps cannot be > >>>>>> guaranteed > >>>>>> > to be consistent > >>>>>> with actual event ordering > >>>>>> due to clock skew. > >>>>>> > isn't it? > >>>>>> > > >>>>>> > Thanks! > >>>>>> > > >>>>>> > -- > >>>>>> > > >>>>>> > Justin Cameron > >>>>>> > > >>>>>> > Senior Software > >>>>>> Engineer | Instaclustr > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> > This email has been > >>>>>> sent on behalf of > >>>>>> Instaclustr Pty Ltd > >>>>>> > (Australia) and > >>>>>> Instaclustr Inc (USA). > >>>>>> > > >>>>>> > This email and any > >>>>>> attachments may contain > >>>>>> confidential and legally > >>>>>> > privileged > >>>>>> information. If you are > >>>>>> not the intended recipient, > do > >>>>>> > not copy or disclose > >>>>>> its content, but please > >>>>>> reply to this email > >>>>>> > immediately and > >>>>>> highlight the error to the > >>>>>> sender and then > >>>>>> > immediately delete > >>>>>> the message. > >>>>>> > > >>>>>> > > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Benjamin Roth > >>>>> Prokurist > >>>>> > >>>>> Jaumo GmbH · www.jaumo.com > >>>>> <http://www.jaumo.com> > >>>>> Wehrstraße 46 · 73035 Göppingen · Germany > >>>>> Phone +49 7161 304880-6 > >>>>> <tel:+49%207161%203048806> · Fax +49 7161 > >>>>> 304880-1 <tel:+49%207161%203048801> > >>>>> AG Ulm · HRB 731058 · Managing Director: > >>>>> Jens Kammerer > >>>>> > >>>>> > >>>> > >>>> > >>>> > >>> > >>> One thing that always bothered me: Intelligent clients > >>> and dynamic snitch are designed to attempt to route > >>> requests to the same node to attempt to take advantage of > >>> cache pinning etc. You would think under these conditions > >>> one could naturally elect a "leader" for a "group" of > >>> keys that could persist for a few hundred milliseconds > >>> and batch up the round trips for a number of operations. > >>> Maybe that is what the distinguished coordinator is in > >>> some regards. > >> > > > > > > My two cents: The current issue is "feature complete" and the author > > stated ready for review 2 years ago. But I can see that as the issue > > stands it forces some hard choices to be made concerning the > > migration path and in depth code changes. > > > > Also I think there is some question (in my mind) as to how we ensure > > some of the subtle contracted/non contracted semantics stay in > > place. As in they work a "certain way" and how confident is everyone > > that a "better way" does not end up causing some pain for someone > > using it currently. I assume this as a common case where a feature > > request is not being engaged with. > > > > > >