I updated the fix version on CASSANDRA-6246 to 4.x. The 3.11.x edit was a bulk move when removing the cassandra-3.X branch and the 3.x Jira version. There are likely other new feature tickets that should really say 4.x.
-- Kind regards, Michael On 02/22/2017 07:28 PM, Kant Kodali wrote: > I hope that patch is reviewed as quickly as possible. We use LWT's > heavily and we are getting a throughput of 600 writes/sec and each write > is 1KB in our case. > > > > > > On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo <edlinuxg...@gmail.com > <mailto:edlinuxg...@gmail.com>> wrote: > > > > On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg <ar...@weisberg.ws > <mailto:ar...@weisberg.ws>> wrote: > > __ > Hi, > > No it's not going to be in 3.11.x. The earliest release it could > make it into is 4.0. > > Ariel > > On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote: >> Hi Ariel, >> >> Can we really expect the fix in 3.11.x as the >> ticket https://issues.apache.org/jira/browse/CASSANDRA-6246 >> >> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22> >> says? >> >> Thanks, >> kant >> >> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg >> <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote: >> >> __ >> Hi, >> >> That would work and would help a lot with the dueling >> proposer issue. >> >> A lot of the leader election stuff is designed to reduce >> the number of roundtrips and not just address the dueling >> proposer issue. Those will have downtime because it's >> there for correctness. Just adding an affinity for a >> specific proposer is probably a free lunch. >> >> I don't think you can group keys because the Paxos >> proposals are per partition which is why we get linear >> scale out for Paxos. I don't believe it's linearizable >> across multiple partitions. You can use the clustering key >> and deterministically pick one of the live replicas for >> that clustering key. Sort the list of replicas by IP, hash >> the clustering key, use the hash as an index into the list >> of replicas. >> >> Batching is of limited usefulness because we only use >> Paxos for CAS I think? So in a batch by definition all but >> one will fail the CAS. This is something where a >> distinguished coordinator could help by failing the rest >> of the contending requests more inexpensively than it >> currently does. >> >> >> Ariel >> >> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote: >>> >>> >>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg >>> <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote: >>> >>> __ >>> Hi, >>> >>> Classic Paxos doesn't have a leader. There are >>> variants on the original Lamport approach that will >>> elect a leader (or some other variation like Mencius) >>> to improve throughput, latency, and performance under >>> contention. Cassandra implements the approach from >>> the beginning of "Paxos Made Simple" >>> (https://goo.gl/SrP0Wb) with no additional >>> optimizations that I am aware of. There is no >>> distinguished proposer (leader). >>> >>> That paper does go on to discuss electing a >>> distinguished proposer, but that was never done for >>> C*. I believe it's not considered a good fit for C* >>> philosophically. >>> >>> Ariel >>> >>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote: >>>> @Ariel Weisberg EPaxos looks very interesting as it >>>> looks like it doesn't need any designated leader for >>>> C* but I am assuming the paxos that is implemented >>>> today for LWT's requires Leader election and If so, >>>> don't we need to have an odd number of nodes or >>>> racks or DC's to satisfy N = 2F + 1 constraint to >>>> tolerate F failures ? I understand it is not needed >>>> when not using LWT's since Cassandra is a >>>> master-less system. >>>> >>>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali >>>> <k...@peernova.com <mailto:k...@peernova.com>> wrote: >>>> >>>> Thanks Ariel! Yes I knew there are so many >>>> variations and optimizations of Paxos. I just >>>> wanted to see if we had any plans on improving >>>> the existing Paxos implementation and it is >>>> great to see the work is under progress! I am >>>> going to follow that ticket and read up the >>>> references pointed in it >>>> >>>> >>>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg >>>> <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> >>>> wrote: >>>> >>>> __ >>>> Hi, >>>> >>>> Cassandra's implementation of Paxos doesn't >>>> implement many optimizations that would >>>> drastically improve throughput and latency. >>>> You need consensus, but it doesn't have to >>>> be exorbitantly expensive and fall over >>>> under any kind of contention. >>>> >>>> For instance you could implement >>>> EPaxos >>>> https://issues.apache.org/jira/browse/CASSANDRA-6246 >>>> >>>> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>, >>>> batch multiple operations into the same >>>> Paxos round, have an affinity for a specific >>>> proposer for a specific partition, implement >>>> asynchronous commit, use a more efficient >>>> implementation of the Paxos log, and maybe >>>> other things. >>>> >>>> >>>> Ariel >>>> >>>> >>>> >>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin >>>> Roth wrote: >>>>> Hi Kant, >>>>> >>>>> If you read the published papers about >>>>> Paxos, you will most probably recognize >>>>> that there is no way to "do it better". >>>>> This is a conceptional thing due to the >>>>> nature of distributed systems + the CAP >>>>> theorem. >>>>> If you want A+P in the triangle, then C is >>>>> very expensive. CS is made for A+P mostly >>>>> with tunable C. In ACID databases this is a >>>>> completely different thing as they are >>>>> mostly either not partition tolerant, not >>>>> highly available or not scalable (in a >>>>> distributed manner, not speaking of >>>>> "monolithic super servers"). >>>>> >>>>> There is no free lunch ... >>>>> >>>>> >>>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali >>>>> <k...@peernova.com <mailto:k...@peernova.com>>: >>>>> >>>>> "That’s the safety blanket everyone >>>>> wants but is extremely expensive, >>>>> especially in Cassandra." >>>>> >>>>> yes LWT's are expensive. Are there any >>>>> plans to make this better? >>>>> >>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant >>>>> Kodali <k...@peernova.com >>>>> <mailto:k...@peernova.com>> wrote: >>>>> >>>>> Hi Jon, >>>>> >>>>> Thanks a lot for your response. I >>>>> am well aware that the LWW != LWT >>>>> but I was talking more in terms of >>>>> LWW with respective to LWT's which >>>>> I believe you answered. so thanks much! >>>>> >>>>> >>>>> kant >>>>> >>>>> >>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon >>>>> Haddad <jonathan.had...@gmail.com >>>>> <mailto:jonathan.had...@gmail.com>> >>>>> wrote: >>>>> >>>>> LWT != Last Write Wins. They >>>>> are totally different. >>>>> >>>>> LWTs give you (assuming you >>>>> also read at SERIAL) “atomic >>>>> consistency”, meaning you are >>>>> able to perform operations >>>>> atomically and in isolation. >>>>> That’s the safety blanket >>>>> everyone wants but is extremely >>>>> expensive, especially in >>>>> Cassandra. The lightweight >>>>> part, btw, may be a little >>>>> optimistic, especially if a key >>>>> is under contention. With >>>>> regard to the “last write” part >>>>> you’re asking about - w/ LWT >>>>> Cassandra provides the >>>>> timestamp and manages it as >>>>> part of the ballot, and it >>>>> always is increasing. >>>>> See >>>>> org.apache.cassandra.service.ClientState#getTimestampForPaxos. >>>>> From the code: >>>>> >>>>> * Returns a timestamp suitable >>>>> for paxos given the timestamp >>>>> of the last known commit (or in >>>>> progress update). >>>>> * Paxos ensures that the >>>>> timestamp it uses for commits >>>>> respects the serial order of >>>>> those commits. It does so >>>>> * by having each replica >>>>> reject any proposal whose >>>>> timestamp is not strictly >>>>> greater than the last proposal it >>>>> * accepted. So in practice, >>>>> which timestamp we use for a >>>>> given proposal doesn't affect >>>>> correctness but it does >>>>> * affect the chance of making >>>>> progress (if we pick a >>>>> timestamp lower than what has >>>>> been proposed before, our >>>>> * new proposal will just get >>>>> rejected). >>>>> >>>>> Effectively paxos removes the >>>>> ability to use custom >>>>> timestamps and addresses clock >>>>> variance by rejecting ballots >>>>> with timestamps less than what >>>>> was last seen. You can learn >>>>> more by reading through the >>>>> other comments and code in that >>>>> file. >>>>> >>>>> Last write wins is a free for >>>>> all that guarantees you >>>>> *nothing* except the timestamp >>>>> is used as a tiebreaker. Here >>>>> we acknowledge things like the >>>>> speed of light as being a real >>>>> problem that isn’t going away >>>>> anytime soon. This problem is >>>>> sometimes addressed with event >>>>> sourcing rather than mutating >>>>> in place. >>>>> >>>>> Hope this helps. >>>>> >>>>> >>>>> Jon >>>>> >>>>> >>>>> >>>>> >>>>>> On Feb 9, 2017, at 5:21 PM, >>>>>> Kant Kodali <k...@peernova.com >>>>>> <mailto:k...@peernova.com>> wrote: >>>>>> >>>>>> @Justin I read this article >>>>>> >>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 >>>>>> >>>>>> <http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0>. >>>>>> And it clearly says >>>>>> Linearizable consistency can >>>>>> be achieved with LWT's. so >>>>>> should I assume >>>>>> the Linearizability in the >>>>>> context of the above article >>>>>> is possible with LWT's and >>>>>> synchronization of clocks >>>>>> through ntpd ? because LWT's >>>>>> also follow Last Write Wins. >>>>>> isn't it? Also another >>>>>> question does most of the >>>>>> production clusters do setup >>>>>> ntpd? If so what is the time >>>>>> it takes to sync? any idea >>>>>> >>>>>> @Micheal Schuler Are you >>>>>> referring to something like >>>>>> true time as in >>>>>> >>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf >>>>>> >>>>>> <https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf>? >>>>>> >>>>>> Actually I never heard of >>>>>> setting up GPS modules and how >>>>>> that can be helpful. Let me >>>>>> research on that but good point. >>>>>> >>>>>> On Thu, Feb 9, 2017 at 5:09 >>>>>> PM, Michael Shuler >>>>>> <mich...@pbandjelly.org >>>>>> <mailto:mich...@pbandjelly.org>> >>>>>> wrote: >>>>>> >>>>>> If you require the best >>>>>> precision you can get, >>>>>> setting up a pair of >>>>>> stratum 1 ntpd masters in >>>>>> each data center location >>>>>> with a GPS modules >>>>>> is not terribly complex. >>>>>> Low latency and jitter on >>>>>> servers you manage. >>>>>> 140ms is a long way away >>>>>> network-wise, and I would >>>>>> suggest that was a >>>>>> poor choice of upstream >>>>>> (probably stratum 2 or 3) >>>>>> source. >>>>>> >>>>>> As Jonathan mentioned, >>>>>> there's no guarantee from >>>>>> Cassandra, but if you >>>>>> need as close as you can >>>>>> get, you'll probably need >>>>>> to do it yourself. >>>>>> >>>>>> (I run several stratum 2 >>>>>> ntpd servers for >>>>>> pool.ntp.org >>>>>> <http://pool.ntp.org/>) >>>>>> >>>>>> -- >>>>>> Kind regards, >>>>>> Michael >>>>>> >>>>>> On 02/09/2017 06:47 PM, >>>>>> Kant Kodali wrote: >>>>>> > Hi Justin, >>>>>> > >>>>>> > There are bunch of >>>>>> issues w.r.t to >>>>>> synchronization of clocks >>>>>> when we >>>>>> > used ntpd. Also the time >>>>>> it took to sync the clocks >>>>>> was approx 140ms >>>>>> > (don't quote me on it >>>>>> though because it is >>>>>> reported by our devops :) >>>>>> > >>>>>> > we have multiple clients >>>>>> (for example bunch of >>>>>> micro services are >>>>>> > reading from Cassandra) >>>>>> I am not sure how one can >>>>>> achieve >>>>>> > Linearizability by >>>>>> setting timestamps on the >>>>>> clients ? since there is no >>>>>> > total ordering across >>>>>> multiple clients. >>>>>> > >>>>>> > Thanks! >>>>>> > >>>>>> > >>>>>> > On Thu, Feb 9, 2017 at >>>>>> 4:16 PM, Justin Cameron >>>>>> <jus...@instaclustr.com >>>>>> <mailto:jus...@instaclustr.com> >>>>>> > <mailto:jus...@instaclustr.com >>>>>> <mailto:jus...@instaclustr.com>>> >>>>>> wrote: >>>>>> > >>>>>> > Hi Kant, >>>>>> > >>>>>> > Clock >>>>>> synchronization is >>>>>> important - you should >>>>>> ensure that ntpd is >>>>>> > properly configured >>>>>> on all nodes. If your >>>>>> particular use case is >>>>>> > especially sensitive >>>>>> to out-of-order mutations >>>>>> it is possible to set >>>>>> > timestamps on the >>>>>> client side using the >>>>>> > drivers. >>>>>> >>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/ >>>>>> >>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/> >>>>>> > >>>>>> >>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/ >>>>>> >>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>> >>>>>> > >>>>>> > We use our own NTP >>>>>> cluster to reduce clock >>>>>> drift as much as >>>>>> > possible, but public >>>>>> NTP servers are good >>>>>> enough for most >>>>>> > uses. >>>>>> >>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/ >>>>>> >>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/> >>>>>> > >>>>>> >>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/ >>>>>> >>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>> >>>>>> > >>>>>> > Cheers, >>>>>> > Justin >>>>>> > >>>>>> > On Thu, 9 Feb 2017 >>>>>> at 16:09 Kant Kodali >>>>>> <k...@peernova.com >>>>>> <mailto:k...@peernova.com> >>>>>> > <mailto:k...@peernova.com >>>>>> <mailto:k...@peernova.com>>> >>>>>> wrote: >>>>>> > >>>>>> > How does >>>>>> Cassandra achieve >>>>>> Linearizability with “Last >>>>>> write >>>>>> > wins” (conflict >>>>>> resolution methods based >>>>>> on time-of-day clocks) ? >>>>>> > >>>>>> > Relying on >>>>>> synchronized clocks are >>>>>> almost certainly >>>>>> > >>>>>> non-linearizable, because >>>>>> clock timestamps cannot be >>>>>> guaranteed >>>>>> > to be consistent >>>>>> with actual event ordering >>>>>> due to clock skew. >>>>>> > isn't it? >>>>>> > >>>>>> > Thanks! >>>>>> > >>>>>> > -- >>>>>> > >>>>>> > Justin Cameron >>>>>> > >>>>>> > Senior Software >>>>>> Engineer | Instaclustr >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > This email has been >>>>>> sent on behalf of >>>>>> Instaclustr Pty Ltd >>>>>> > (Australia) and >>>>>> Instaclustr Inc (USA). >>>>>> > >>>>>> > This email and any >>>>>> attachments may contain >>>>>> confidential and legally >>>>>> > privileged >>>>>> information. If you are >>>>>> not the intended recipient, do >>>>>> > not copy or disclose >>>>>> its content, but please >>>>>> reply to this email >>>>>> > immediately and >>>>>> highlight the error to the >>>>>> sender and then >>>>>> > immediately delete >>>>>> the message. >>>>>> > >>>>>> > >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Benjamin Roth >>>>> Prokurist >>>>> >>>>> Jaumo GmbH · www.jaumo.com >>>>> <http://www.jaumo.com> >>>>> Wehrstraße 46 · 73035 Göppingen · Germany >>>>> Phone +49 7161 304880-6 >>>>> <tel:+49%207161%203048806> · Fax +49 7161 >>>>> 304880-1 <tel:+49%207161%203048801> >>>>> AG Ulm · HRB 731058 · Managing Director: >>>>> Jens Kammerer >>>>> >>>>> >>>> >>>> >>>> >>> >>> One thing that always bothered me: Intelligent clients >>> and dynamic snitch are designed to attempt to route >>> requests to the same node to attempt to take advantage of >>> cache pinning etc. You would think under these conditions >>> one could naturally elect a "leader" for a "group" of >>> keys that could persist for a few hundred milliseconds >>> and batch up the round trips for a number of operations. >>> Maybe that is what the distinguished coordinator is in >>> some regards. >> > > > My two cents: The current issue is "feature complete" and the author > stated ready for review 2 years ago. But I can see that as the issue > stands it forces some hard choices to be made concerning the > migration path and in depth code changes. > > Also I think there is some question (in my mind) as to how we ensure > some of the subtle contracted/non contracted semantics stay in > place. As in they work a "certain way" and how confident is everyone > that a "better way" does not end up causing some pain for someone > using it currently. I assume this as a common case where a feature > request is not being engaged with. > >