Hi,
Classic Paxos doesn't have a leader. There are variants on the original Lamport approach that will elect a leader (or some other variation like Mencius) to improve throughput, latency, and performance under contention. Cassandra implements the approach from the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware of. There is no distinguished proposer (leader). That paper does go on to discuss electing a distinguished proposer, but that was never done for C*. I believe it's not considered a good fit for C* philosophically. Ariel On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote: > @Ariel Weisberg EPaxos looks very interesting as it looks like it > doesn't need any designated leader for C* but I am assuming the paxos > that is implemented today for LWT's requires Leader election and If > so, don't we need to have an odd number of nodes or racks or DC's to > satisfy N = 2F + 1 constraint to tolerate F failures ? I understand > it is not needed when not using LWT's since Cassandra is a master- > less system. > > On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali > <k...@peernova.com> wrote: >> Thanks Ariel! Yes I knew there are so many variations and >> optimizations of Paxos. I just wanted to see if we had any plans on >> improving the existing Paxos implementation and it is great to see >> the work is under progress! I am going to follow that ticket and read >> up the references pointed in it >> >> >> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg >> <ar...@weisberg.ws> wrote: >>> __ >>> Hi, >>> >>> Cassandra's implementation of Paxos doesn't implement many >>> optimizations that would drastically improve throughput and latency. >>> You need consensus, but it doesn't have to be exorbitantly expensive >>> and fall over under any kind of contention. >>> >>> For instance you could implement EPaxos >>> https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch >>> multiple operations into the same Paxos round, have an affinity for >>> a specific proposer for a specific partition, implement asynchronous >>> commit, use a more efficient implementation of the Paxos log, and >>> maybe other things. >>> >>> >>> Ariel >>> >>> >>> >>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote: >>>> Hi Kant, >>>> >>>> If you read the published papers about Paxos, you will most >>>> probably recognize that there is no way to "do it better". This is >>>> a conceptional thing due to the nature of distributed systems + the >>>> CAP theorem. >>>> If you want A+P in the triangle, then C is very expensive. CS is >>>> made for A+P mostly with tunable C. In ACID databases this is a >>>> completely different thing as they are mostly either not partition >>>> tolerant, not highly available or not scalable (in a distributed >>>> manner, not speaking of "monolithic super servers"). >>>> >>>> There is no free lunch ... >>>> >>>> >>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali <k...@peernova.com>: >>>>> "That’s the safety blanket everyone wants but is extremely >>>>> expensive, especially in Cassandra." >>>>> >>>>> yes LWT's are expensive. Are there any plans to make this better? >>>>> >>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali <k...@peernova.com> >>>>> wrote: >>>>>> Hi Jon, >>>>>> >>>>>> Thanks a lot for your response. I am well aware that the LWW != >>>>>> LWT but I was talking more in terms of LWW with respective to >>>>>> LWT's which I believe you answered. so thanks much! >>>>>> >>>>>> >>>>>> kant >>>>>> >>>>>> >>>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad >>>>>> <jonathan.had...@gmail.com> wrote: >>>>>>> LWT != Last Write Wins. They are totally different. >>>>>>> >>>>>>> LWTs give you (assuming you also read at SERIAL) “atomic >>>>>>> consistency”, meaning you are able to perform operations >>>>>>> atomically and in isolation. That’s the safety blanket everyone >>>>>>> wants but is extremely expensive, especially in Cassandra. The >>>>>>> lightweight part, btw, may be a little optimistic, especially if >>>>>>> a key is under contention. With regard to the “last write” part >>>>>>> you’re asking about - w/ LWT Cassandra provides the timestamp >>>>>>> and manages it as part of the ballot, and it always is >>>>>>> increasing. See >>>>>>> org.apache.cassandra.service.ClientState#getTimestampForPaxos. >>>>>>> From the code: >>>>>>> >>>>>>> * Returns a timestamp suitable for paxos given the timestamp of >>>>>>> the last known commit (or in progress update). >>>>>>> * Paxos ensures that the timestamp it uses for commits respects >>>>>>> the serial order of those commits. It does so >>>>>>> * by having each replica reject any proposal whose timestamp is >>>>>>> not strictly greater than the last proposal it >>>>>>> * accepted. So in practice, which timestamp we use for a given >>>>>>> proposal doesn't affect correctness but it does >>>>>>> * affect the chance of making progress (if we pick a timestamp >>>>>>> lower than what has been proposed before, our >>>>>>> * new proposal will just get rejected). >>>>>>> >>>>>>> Effectively paxos removes the ability to use custom timestamps >>>>>>> and addresses clock variance by rejecting ballots with >>>>>>> timestamps less than what was last seen. You can learn more by >>>>>>> reading through the other comments and code in that file. >>>>>>> >>>>>>> Last write wins is a free for all that guarantees you *nothing* >>>>>>> except the timestamp is used as a tiebreaker. Here we >>>>>>> acknowledge things like the speed of light as being a real >>>>>>> problem that isn’t going away anytime soon. This problem is >>>>>>> sometimes addressed with event sourcing rather than mutating in >>>>>>> place. >>>>>>> >>>>>>> Hope this helps. >>>>>>> >>>>>>> >>>>>>> Jon >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <k...@peernova.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> @Justin I read this article >>>>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0. >>>>>>>> And it clearly says Linearizable consistency can be achieved >>>>>>>> with LWT's. so should I assume the Linearizability in the >>>>>>>> context of the above article is possible with LWT's and >>>>>>>> synchronization of clocks through ntpd ? because LWT's also >>>>>>>> follow Last Write Wins. isn't it? Also another question does >>>>>>>> most of the production clusters do setup ntpd? If so what is >>>>>>>> the time it takes to sync? any idea >>>>>>>> >>>>>>>> @Micheal Schuler Are you referring to something like true time >>>>>>>> as in >>>>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf? >>>>>>>> Actually I never heard of setting up GPS modules and how that >>>>>>>> can be helpful. Let me research on that but good point. >>>>>>>> >>>>>>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler >>>>>>>> <mich...@pbandjelly.org> wrote: >>>>>>>>> If you require the best precision you can get, setting up a >>>>>>>>> pair of >>>>>>>>> stratum 1 ntpd masters in each data center location with a GPS >>>>>>>>> modules >>>>>>>>> is not terribly complex. Low latency and jitter on servers you >>>>>>>>> manage. >>>>>>>>> 140ms is a long way away network-wise, and I would suggest >>>>>>>>> that was a >>>>>>>>> poor choice of upstream (probably stratum 2 or 3) source. >>>>>>>>> >>>>>>>>> As Jonathan mentioned, there's no guarantee from Cassandra, >>>>>>>>> but if you >>>>>>>>> need as close as you can get, you'll probably need to do it >>>>>>>>> yourself. >>>>>>>>> >>>>>>>>> (I run several stratum 2 ntpd servers for pool.ntp.org[2]) >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Kind regards, Michael >>>>>>>>> >>>>>>>>> On 02/09/2017 06:47 PM, Kant Kodali wrote: >>>>>>>>> > Hi Justin, >>>>>>>>> > >>>>>>>>> > There are bunch of issues w.r.t to synchronization of >>>>>>>>> > clocks when we used ntpd. Also the time it took to sync the >>>>>>>>> > clocks was approx 140ms (don't quote me on it though >>>>>>>>> > because it is reported by our devops :) >>>>>>>>> > >>>>>>>>> > we have multiple clients (for example bunch of micro >>>>>>>>> > services are reading from Cassandra) I am not sure how one >>>>>>>>> > can achieve Linearizability by setting timestamps on the >>>>>>>>> > clients ? since there is no total ordering across multiple >>>>>>>>> > clients. >>>>>>>>> > >>>>>>>>> > Thanks! >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron >>>>>>>>> > <jus...@instaclustr.com >>>>>>>>> > <mailto:jus...@instaclustr.com>> wrote: >>>>>>>>> > >>>>>>>>> > Hi Kant, >>>>>>>>> > >>>>>>>>> > Clock synchronization is important - you should ensure >>>>>>>>> > that ntpd is properly configured on all nodes. If your >>>>>>>>> > particular use case is especially sensitive to out-of- >>>>>>>>> > order mutations it is possible to set timestamps on the >>>>>>>>> > client side using the drivers. >>>>>>>>> > >>>>>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/ >>>>>>>>> > >>>>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/> >>>>>>>>> > >>>>>>>>> > We use our own NTP cluster to reduce clock drift as >>>>>>>>> > much as possible, but public NTP servers are good >>>>>>>>> > enough for most uses. >>>>>>>>> > >>>>>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/ >>>>>>>>> > >>>>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/> >>>>>>>>> > >>>>>>>>> > Cheers, Justin >>>>>>>>> > >>>>>>>>> > On Thu, 9 Feb 2017 at 16:09 Kant Kodali >>>>>>>>> > <k...@peernova.com >>>>>>>>> > <mailto:k...@peernova.com>> wrote: >>>>>>>>> > >>>>>>>>> > How does Cassandra achieve Linearizability with >>>>>>>>> > “Last write wins” (conflict resolution methods >>>>>>>>> > based on time-of-day clocks) ? >>>>>>>>> > >>>>>>>>> > Relying on synchronized clocks are almost certainly >>>>>>>>> > non-linearizable, because clock timestamps cannot >>>>>>>>> > be guaranteed to be consistent with actual event >>>>>>>>> > ordering due to clock skew. isn't it? >>>>>>>>> > >>>>>>>>> > Thanks! >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > >>>>>>>>> > Justin Cameron >>>>>>>>> > >>>>>>>>> > Senior Software Engineer | Instaclustr >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > This email has been sent on behalf of Instaclustr Pty >>>>>>>>> > Ltd >>>>>>>>> > (Australia) and Instaclustr Inc (USA). >>>>>>>>> > >>>>>>>>> > This email and any attachments may contain confidential >>>>>>>>> > and legally >>>>>>>>> > privileged information. If you are not the intended >>>>>>>>> > recipient, do >>>>>>>>> > not copy or disclose its content, but please reply to >>>>>>>>> > this email >>>>>>>>> > immediately and highlight the error to the sender and >>>>>>>>> > then >>>>>>>>> > immediately delete the message. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Benjamin Roth >>>> Prokurist >>>> >>>> Jaumo GmbH · www.jaumo.com >>>> Wehrstraße 46 · 73035 Göppingen · Germany >>>> Phone +49 7161 304880-6[3] · Fax +49 7161 304880-1[4] >>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >>> Links: 1. https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22 2. http://pool.ntp.org/ 3. tel:+49%207161%203048806 4. tel:+49%207161%203048801