Hi,
No it's not going to be in 3.11.x. The earliest release it could make it into is 4.0. Ariel On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote: > Hi Ariel, > > Can we really expect the fix in 3.11.x as the ticket > https://issues.apache.org/jira/browse/CASSANDRA-6246[1] says? > > Thanks, > kant > > On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg > <ar...@weisberg.ws> wrote: >> __ >> Hi, >> >> That would work and would help a lot with the dueling proposer issue. >> >> A lot of the leader election stuff is designed to reduce the number >> of roundtrips and not just address the dueling proposer issue. Those >> will have downtime because it's there for correctness. Just adding an >> affinity for a specific proposer is probably a free lunch. >> >> I don't think you can group keys because the Paxos proposals are per >> partition which is why we get linear scale out for Paxos. I don't >> believe it's linearizable across multiple partitions. You can use the >> clustering key and deterministically pick one of the live replicas >> for that clustering key. Sort the list of replicas by IP, hash the >> clustering key, use the hash as an index into the list of replicas. >> >> Batching is of limited usefulness because we only use Paxos for CAS I >> think? So in a batch by definition all but one will fail the CAS. >> This is something where a distinguished coordinator could help by >> failing the rest of the contending requests more inexpensively than >> it currently does. >> >> >> Ariel >> >> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote: >>> >>> >>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg <ar...@weisberg.ws> >>> wrote: >>>> __ >>>> Hi, >>>> >>>> Classic Paxos doesn't have a leader. There are variants on the >>>> original Lamport approach that will elect a leader (or some other >>>> variation like Mencius) to improve throughput, latency, and >>>> performance under contention. Cassandra implements the approach >>>> from the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb) >>>> with no additional optimizations that I am aware of. There is no >>>> distinguished proposer (leader). >>>> >>>> That paper does go on to discuss electing a distinguished >>>> proposer, but that was never done for C*. I believe it's not >>>> considered a good fit for C* philosophically. >>>> >>>> Ariel >>>> >>>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote: >>>>> @Ariel Weisberg EPaxos looks very interesting as it looks like it >>>>> doesn't need any designated leader for C* but I am assuming the >>>>> paxos that is implemented today for LWT's requires Leader election >>>>> and If so, don't we need to have an odd number of nodes or racks >>>>> or DC's to satisfy N = 2F + 1 constraint to tolerate F failures ? >>>>> I understand it is not needed when not using LWT's since Cassandra >>>>> is a master-less system. >>>>> >>>>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali <k...@peernova.com> >>>>> wrote: >>>>>> Thanks Ariel! Yes I knew there are so many variations and >>>>>> optimizations of Paxos. I just wanted to see if we had any plans >>>>>> on improving the existing Paxos implementation and it is great to >>>>>> see the work is under progress! I am going to follow that ticket >>>>>> and read up the references pointed in it >>>>>> >>>>>> >>>>>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg >>>>>> <ar...@weisberg.ws> wrote: >>>>>>> __ >>>>>>> Hi, >>>>>>> >>>>>>> Cassandra's implementation of Paxos doesn't implement many >>>>>>> optimizations that would drastically improve throughput and >>>>>>> latency. You need consensus, but it doesn't have to be >>>>>>> exorbitantly expensive and fall over under any kind of >>>>>>> contention. >>>>>>> >>>>>>> For instance you could implement EPaxos >>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[2], batch >>>>>>> multiple operations into the same Paxos round, have an affinity >>>>>>> for a specific proposer for a specific partition, implement >>>>>>> asynchronous commit, use a more efficient implementation of the >>>>>>> Paxos log, and maybe other things. >>>>>>> >>>>>>> >>>>>>> Ariel >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote: >>>>>>>> Hi Kant, >>>>>>>> >>>>>>>> If you read the published papers about Paxos, you will most >>>>>>>> probably recognize that there is no way to "do it better". This >>>>>>>> is a conceptional thing due to the nature of distributed >>>>>>>> systems + the CAP theorem. >>>>>>>> If you want A+P in the triangle, then C is very expensive. CS >>>>>>>> is made for A+P mostly with tunable C. In ACID databases this >>>>>>>> is a completely different thing as they are mostly either not >>>>>>>> partition tolerant, not highly available or not scalable (in a >>>>>>>> distributed manner, not speaking of "monolithic super >>>>>>>> servers"). >>>>>>>> >>>>>>>> There is no free lunch ... >>>>>>>> >>>>>>>> >>>>>>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali <k...@peernova.com>: >>>>>>>>> "That’s the safety blanket everyone wants but is extremely >>>>>>>>> expensive, especially in Cassandra." >>>>>>>>> >>>>>>>>> yes LWT's are expensive. Are there any plans to make this >>>>>>>>> better? >>>>>>>>> >>>>>>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali >>>>>>>>> <k...@peernova.com> wrote: >>>>>>>>>> Hi Jon, >>>>>>>>>> >>>>>>>>>> Thanks a lot for your response. I am well aware that the LWW >>>>>>>>>> != LWT but I was talking more in terms of LWW with respective >>>>>>>>>> to LWT's which I believe you answered. so thanks much! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> kant >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad >>>>>>>>>> <jonathan.had...@gmail.com> wrote: >>>>>>>>>>> LWT != Last Write Wins. They are totally different. >>>>>>>>>>> >>>>>>>>>>> LWTs give you (assuming you also read at SERIAL) “atomic >>>>>>>>>>> consistency”, meaning you are able to perform operations >>>>>>>>>>> atomically and in isolation. That’s the safety blanket >>>>>>>>>>> everyone wants but is extremely expensive, especially in >>>>>>>>>>> Cassandra. The lightweight part, btw, may be a little >>>>>>>>>>> optimistic, especially if a key is under contention. With >>>>>>>>>>> regard to the “last write” part you’re asking about - w/ LWT >>>>>>>>>>> Cassandra provides the timestamp and manages it as part of >>>>>>>>>>> the ballot, and it always is increasing. See org.apache.ca- >>>>>>>>>>> ssandra.service.ClientState#getTimestampForPaxos. From the >>>>>>>>>>> code: >>>>>>>>>>> >>>>>>>>>>> * Returns a timestamp suitable for paxos given the >>>>>>>>>>> timestamp of the last known commit (or in progress >>>>>>>>>>> update). >>>>>>>>>>> * Paxos ensures that the timestamp it uses for commits >>>>>>>>>>> respects the serial order of those commits. It does so >>>>>>>>>>> * by having each replica reject any proposal whose >>>>>>>>>>> timestamp is not strictly greater than the last proposal >>>>>>>>>>> it >>>>>>>>>>> * accepted. So in practice, which timestamp we use for a >>>>>>>>>>> given proposal doesn't affect correctness but it does >>>>>>>>>>> * affect the chance of making progress (if we pick a >>>>>>>>>>> timestamp lower than what has been proposed before, our >>>>>>>>>>> * new proposal will just get rejected). >>>>>>>>>>> >>>>>>>>>>> Effectively paxos removes the ability to use custom >>>>>>>>>>> timestamps and addresses clock variance by rejecting ballots >>>>>>>>>>> with timestamps less than what was last seen. You can learn >>>>>>>>>>> more by reading through the other comments and code in that >>>>>>>>>>> file. >>>>>>>>>>> >>>>>>>>>>> Last write wins is a free for all that guarantees you >>>>>>>>>>> *nothing* except the timestamp is used as a tiebreaker. >>>>>>>>>>> Here we acknowledge things like the speed of light as being >>>>>>>>>>> a real problem that isn’t going away anytime soon. This >>>>>>>>>>> problem is sometimes addressed with event sourcing rather >>>>>>>>>>> than mutating in place. >>>>>>>>>>> >>>>>>>>>>> Hope this helps. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Jon >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <k...@peernova.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> @Justin I read this article >>>>>>>>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0. >>>>>>>>>>>> And it clearly says Linearizable consistency can be >>>>>>>>>>>> achieved with LWT's. so should I assume the >>>>>>>>>>>> Linearizability in the context of the above article is >>>>>>>>>>>> possible with LWT's and synchronization of clocks through >>>>>>>>>>>> ntpd ? because LWT's also follow Last Write Wins. isn't it? >>>>>>>>>>>> Also another question does most of the production clusters >>>>>>>>>>>> do setup ntpd? If so what is the time it takes to sync? any >>>>>>>>>>>> idea >>>>>>>>>>>> >>>>>>>>>>>> @Micheal Schuler Are you referring to something like true >>>>>>>>>>>> time as in >>>>>>>>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf? >>>>>>>>>>>> Actually I never heard of setting up GPS modules and how >>>>>>>>>>>> that can be helpful. Let me research on that but good >>>>>>>>>>>> point. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler >>>>>>>>>>>> <mich...@pbandjelly.org> wrote: >>>>>>>>>>>>> If you require the best precision you can get, setting up >>>>>>>>>>>>> a pair of >>>>>>>>>>>>> stratum 1 ntpd masters in each data center location with a >>>>>>>>>>>>> GPS modules >>>>>>>>>>>>> is not terribly complex. Low latency and jitter on servers >>>>>>>>>>>>> you manage. >>>>>>>>>>>>> 140ms is a long way away network-wise, and I would suggest >>>>>>>>>>>>> that was a >>>>>>>>>>>>> poor choice of upstream (probably stratum 2 or 3) source. >>>>>>>>>>>>> >>>>>>>>>>>>> As Jonathan mentioned, there's no guarantee from >>>>>>>>>>>>> Cassandra, but if you >>>>>>>>>>>>> need as close as you can get, you'll probably need to do >>>>>>>>>>>>> it yourself. >>>>>>>>>>>>> >>>>>>>>>>>>> (I run several stratum 2 ntpd servers for pool.ntp.org[3]) >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Kind regards, Michael >>>>>>>>>>>>> >>>>>>>>>>>>> On 02/09/2017 06:47 PM, Kant Kodali wrote: >>>>>>>>>>>>> > Hi Justin, >>>>>>>>>>>>> > >>>>>>>>>>>>> > There are bunch of issues w.r.t to synchronization of >>>>>>>>>>>>> > clocks when we used ntpd. Also the time it took to sync >>>>>>>>>>>>> > the clocks was approx 140ms (don't quote me on it >>>>>>>>>>>>> > though because it is reported by our devops :) >>>>>>>>>>>>> > >>>>>>>>>>>>> > we have multiple clients (for example bunch of micro >>>>>>>>>>>>> > services are reading from Cassandra) I am not sure how >>>>>>>>>>>>> > one can achieve Linearizability by setting timestamps >>>>>>>>>>>>> > on the clients ? since there is no total ordering >>>>>>>>>>>>> > across multiple clients. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks! >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron >>>>>>>>>>>>> > <jus...@instaclustr.com >>>>>>>>>>>>> > <mailto:jus...@instaclustr.com>> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Hi Kant, >>>>>>>>>>>>> > >>>>>>>>>>>>> > Clock synchronization is important - you should >>>>>>>>>>>>> > ensure that ntpd is properly configured on all >>>>>>>>>>>>> > nodes. If your particular use case is especially >>>>>>>>>>>>> > sensitive to out-of-order mutations it is possible >>>>>>>>>>>>> > to set timestamps on the client side using the >>>>>>>>>>>>> > drivers. >>>>>>>>>>>>> > >>>>>>>>>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/ >>>>>>>>>>>>> > >>>>>>>>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/> >>>>>>>>>>>>> > >>>>>>>>>>>>> > We use our own NTP cluster to reduce clock drift as >>>>>>>>>>>>> > much as possible, but public NTP servers are good >>>>>>>>>>>>> > enough for most uses. >>>>>>>>>>>>> > >>>>>>>>>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/ >>>>>>>>>>>>> > >>>>>>>>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/> >>>>>>>>>>>>> > >>>>>>>>>>>>> > Cheers, Justin >>>>>>>>>>>>> > >>>>>>>>>>>>> > On Thu, 9 Feb 2017 at 16:09 Kant Kodali >>>>>>>>>>>>> > <k...@peernova.com >>>>>>>>>>>>> > <mailto:k...@peernova.com>> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > How does Cassandra achieve Linearizability with >>>>>>>>>>>>> > “Last write wins” (conflict resolution methods >>>>>>>>>>>>> > based on time-of-day clocks) ? >>>>>>>>>>>>> > >>>>>>>>>>>>> > Relying on synchronized clocks are almost >>>>>>>>>>>>> > certainly non-linearizable, because clock >>>>>>>>>>>>> > timestamps cannot be guaranteed to be >>>>>>>>>>>>> > consistent with actual event ordering due to >>>>>>>>>>>>> > clock skew. isn't it? >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks! >>>>>>>>>>>>> > >>>>>>>>>>>>> > -- >>>>>>>>>>>>> > >>>>>>>>>>>>> > Justin Cameron >>>>>>>>>>>>> > >>>>>>>>>>>>> > Senior Software Engineer | Instaclustr >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > This email has been sent on behalf of Instaclustr >>>>>>>>>>>>> > Pty Ltd >>>>>>>>>>>>> > (Australia) and Instaclustr Inc (USA). >>>>>>>>>>>>> > >>>>>>>>>>>>> > This email and any attachments may contain >>>>>>>>>>>>> > confidential and legally >>>>>>>>>>>>> > privileged information. If you are not the intended >>>>>>>>>>>>> > recipient, do >>>>>>>>>>>>> > not copy or disclose its content, but please reply >>>>>>>>>>>>> > to this email >>>>>>>>>>>>> > immediately and highlight the error to the sender >>>>>>>>>>>>> > and then >>>>>>>>>>>>> > immediately delete the message. >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Benjamin Roth >>>>>>>> Prokurist >>>>>>>> >>>>>>>> Jaumo GmbH · www.jaumo.com >>>>>>>> Wehrstraße 46 · 73035 Göppingen · Germany >>>>>>>> Phone +49 7161 304880-6[4] · Fax +49 7161 304880-1[5] >>>>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> >>> One thing that always bothered me: Intelligent clients and dynamic >>> snitch are designed to attempt to route requests to the same node to >>> attempt to take advantage of cache pinning etc. You would think >>> under these conditions one could naturally elect a "leader" for a >>> "group" of keys that could persist for a few hundred milliseconds >>> and batch up the round trips for a number of operations. Maybe that >>> is what the distinguished coordinator is in some regards. >> Links: 1. https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22 2. https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22 3. http://pool.ntp.org/ 4. tel:+49%207161%203048806 5. tel:+49%207161%203048801