Hi,


Classic Paxos doesn't have a leader. There are variants on the
original Lamport approach that will elect a leader (or some other
variation like Mencius) to improve throughput, latency, and
performance under contention. Cassandra implements the approach from
the beginning of "Paxos Made Simple" (https://goo.gl/SrP0Wb) with no
additional optimizations that I am aware of. There is no distinguished
proposer (leader).


That paper does  go on to discuss electing a distinguished proposer, but
that was never done for C*. I believe it's not considered a good fit for
C* philosophically.


Ariel



On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:

> @Ariel Weisberg EPaxos looks very interesting as it looks like it
> doesn't need any designated leader for C* but I am assuming the paxos
> that is implemented today for LWT's requires Leader election and If
> so, don't we need to have an odd number of nodes or racks or DC's to
> satisfy N = 2F + 1 constraint to tolerate F failures ? I understand
> it is not needed when not using LWT's since Cassandra is a master-
> less system.
> 

> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali
> <k...@peernova.com> wrote:
>> Thanks Ariel! Yes I knew there are so many variations and
>> optimizations of Paxos. I just wanted to see if we had any plans on
>> improving the existing Paxos implementation and it is great to see
>> the work is under progress! I am going to follow that ticket and read
>> up the references pointed in it
>> 

>> 

>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg
>> <ar...@weisberg.ws> wrote:
>>> __

>>> Hi,

>>> 

>>> Cassandra's implementation of Paxos doesn't implement many
>>> optimizations that would drastically improve throughput and latency.
>>> You need consensus, but it doesn't have to be exorbitantly expensive
>>> and fall over under any kind of contention.
>>> 

>>> For instance you could implement EPaxos
>>> https://issues.apache.org/jira/browse/CASSANDRA-6246[1], batch
>>> multiple operations into the same Paxos round, have an affinity for
>>> a specific proposer for a specific partition, implement asynchronous
>>> commit, use a more efficient implementation of the Paxos log, and
>>> maybe other things.
>>> 

>>> 

>>> Ariel

>>> 

>>> 

>>> 

>>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:

>>>> Hi Kant,

>>>> 

>>>> If you read the published papers about Paxos, you will most
>>>> probably recognize that there is no way to "do it better". This is
>>>> a conceptional thing due to the nature of distributed systems + the
>>>> CAP theorem.
>>>> If you want A+P in the triangle, then C is very expensive. CS is
>>>> made for A+P mostly with tunable C. In ACID databases this is a
>>>> completely different thing as they are mostly either not partition
>>>> tolerant, not highly available or not scalable (in a distributed
>>>> manner, not speaking of "monolithic super servers").
>>>> 

>>>> There is no free lunch ...

>>>> 

>>>> 

>>>> 2017-02-10 11:09 GMT+01:00 Kant Kodali <k...@peernova.com>:

>>>>> "That’s the safety blanket everyone wants but is extremely
>>>>> expensive, especially in Cassandra."
>>>>> 

>>>>> yes LWT's are expensive. Are there any plans to make this better?
>>>>> 

>>>>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali <k...@peernova.com>
>>>>> wrote:
>>>>>> Hi Jon,

>>>>>> 

>>>>>> Thanks a lot for your response. I am well aware that the LWW !=
>>>>>> LWT but I was talking more in terms of LWW with respective to
>>>>>> LWT's which I believe you answered. so thanks much!
>>>>>> 

>>>>>> 

>>>>>> kant

>>>>>> 

>>>>>> 

>>>>>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad
>>>>>> <jonathan.had...@gmail.com> wrote:
>>>>>>> LWT != Last Write Wins.  They are totally different.  

>>>>>>> 

>>>>>>> LWTs give you (assuming you also read at SERIAL) “atomic
>>>>>>> consistency”, meaning you are able to perform operations
>>>>>>> atomically and in isolation.  That’s the safety blanket everyone
>>>>>>> wants but is extremely expensive, especially in Cassandra.  The
>>>>>>> lightweight part, btw, may be a little optimistic, especially if
>>>>>>> a key is under contention.  With regard to the “last write” part
>>>>>>> you’re asking about - w/ LWT Cassandra provides the timestamp
>>>>>>> and manages it as part of the ballot, and it always is
>>>>>>> increasing.  See
>>>>>>> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
>>>>>>> From the code:
>>>>>>> 

>>>>>>>  * Returns a timestamp suitable for paxos given the timestamp of
>>>>>>>    the last known commit (or in progress update).
>>>>>>>  * Paxos ensures that the timestamp it uses for commits respects
>>>>>>>    the serial order of those commits. It does so
>>>>>>>  * by having each replica reject any proposal whose timestamp is
>>>>>>>    not strictly greater than the last proposal it
>>>>>>>  * accepted. So in practice, which timestamp we use for a given
>>>>>>>    proposal doesn't affect correctness but it does
>>>>>>>  * affect the chance of making progress (if we pick a timestamp
>>>>>>>    lower than what has been proposed before, our
>>>>>>>  * new proposal will just get rejected).

>>>>>>> 

>>>>>>> Effectively paxos removes the ability to use custom timestamps
>>>>>>> and addresses clock variance by rejecting ballots with
>>>>>>> timestamps less than what was last seen.  You can learn more by
>>>>>>> reading through the other comments and code in that file.
>>>>>>> 

>>>>>>> Last write wins is a free for all that guarantees you *nothing*
>>>>>>> except the timestamp is used as a tiebreaker.  Here we
>>>>>>> acknowledge things like the speed of light as being a real
>>>>>>> problem that isn’t going away anytime soon.  This problem is
>>>>>>> sometimes addressed with event sourcing rather than mutating in
>>>>>>> place.
>>>>>>> 

>>>>>>> Hope this helps.

>>>>>>> 

>>>>>>> 

>>>>>>> Jon

>>>>>>> 

>>>>>>> 

>>>>>>> 

>>>>>>> 

>>>>>>>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <k...@peernova.com>
>>>>>>>> wrote:
>>>>>>>> 

>>>>>>>> @Justin I read this article
>>>>>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0.
>>>>>>>> And it clearly says Linearizable consistency can be achieved
>>>>>>>> with LWT's.  so should I assume the Linearizability in the
>>>>>>>> context of the above article is possible with LWT's and
>>>>>>>> synchronization of clocks through ntpd ? because LWT's also
>>>>>>>> follow Last Write Wins. isn't it? Also another question does
>>>>>>>> most of the production clusters do setup ntpd? If so what is
>>>>>>>> the time it takes to sync? any idea
>>>>>>>> 

>>>>>>>> @Micheal Schuler Are you referring to  something like true time
>>>>>>>> as in
>>>>>>>> https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf?
>>>>>>>> Actually I never heard of setting up GPS modules and how that
>>>>>>>> can be helpful. Let me research on that but good point.
>>>>>>>> 

>>>>>>>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler
>>>>>>>> <mich...@pbandjelly.org> wrote:
>>>>>>>>> If you require the best precision you can get, setting up a
>>>>>>>>> pair of
>>>>>>>>> stratum 1 ntpd masters in each data center location with a GPS
>>>>>>>>> modules
>>>>>>>>> is not terribly complex. Low latency and jitter on servers you
>>>>>>>>> manage.
>>>>>>>>> 140ms is a long way away network-wise, and I would suggest
>>>>>>>>> that was a
>>>>>>>>> poor choice of upstream (probably stratum 2 or 3) source.

>>>>>>>>> 

>>>>>>>>> As Jonathan mentioned, there's no guarantee from Cassandra,
>>>>>>>>> but if you
>>>>>>>>> need as close as you can get, you'll probably need to do it
>>>>>>>>> yourself.
>>>>>>>>> 

>>>>>>>>> (I run several stratum 2 ntpd servers for pool.ntp.org[2])

>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>  Kind regards, Michael
>>>>>>>>>
>>>>>>>>>  On 02/09/2017 06:47 PM, Kant Kodali wrote:
>>>>>>>>>  > Hi Justin,
>>>>>>>>>  >
>>>>>>>>>  > There are bunch of issues w.r.t to synchronization of
>>>>>>>>>  > clocks when we used ntpd. Also the time it took to sync the
>>>>>>>>>  > clocks was approx 140ms (don't quote me on it though
>>>>>>>>>  > because it is reported by our devops :)
>>>>>>>>>  >
>>>>>>>>>  > we have multiple clients (for example bunch of micro
>>>>>>>>>  > services are reading from Cassandra) I am not sure how one
>>>>>>>>>  > can achieve Linearizability by setting timestamps on the
>>>>>>>>>  > clients ? since there is no total ordering across multiple
>>>>>>>>>  > clients.
>>>>>>>>>  >
>>>>>>>>>  > Thanks!
>>>>>>>>>  >
>>>>>>>>>  >
>>>>>>>>>  > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron
>>>>>>>>>  > <jus...@instaclustr.com
>>>>>>>>> > <mailto:jus...@instaclustr.com>> wrote:
>>>>>>>>>  >
>>>>>>>>>  >     Hi Kant,
>>>>>>>>>  >
>>>>>>>>>  >     Clock synchronization is important - you should ensure
>>>>>>>>>  >     that ntpd is properly configured on all nodes. If your
>>>>>>>>>  >     particular use case is especially sensitive to out-of-
>>>>>>>>>  >     order mutations it is possible to set timestamps on the
>>>>>>>>>  >     client side using the drivers.
>>>>>>>>>  >     
>>>>>>>>> https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/
>>>>>>>>>  >     
>>>>>>>>> <https://docs.datastax.com/en/developer/java-driver/3.1/manual/query_timestamps/>
>>>>>>>>>  >
>>>>>>>>>  >     We use our own NTP cluster to reduce clock drift as
>>>>>>>>>  >     much as possible, but public NTP servers are good
>>>>>>>>>  >     enough for most uses.
>>>>>>>>>  >     
>>>>>>>>> https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/
>>>>>>>>>  >     
>>>>>>>>> <https://www.instaclustr.com/blog/2015/11/05/apache-cassandra-synchronization/>
>>>>>>>>>  >
>>>>>>>>>  >     Cheers, Justin
>>>>>>>>>  >
>>>>>>>>>  >     On Thu, 9 Feb 2017 at 16:09 Kant Kodali
>>>>>>>>>  >     <k...@peernova.com
>>>>>>>>> >     <mailto:k...@peernova.com>> wrote:
>>>>>>>>>  >
>>>>>>>>>  >         How does Cassandra achieve Linearizability with
>>>>>>>>>  >         “Last write wins” (conflict resolution methods
>>>>>>>>>  >         based on time-of-day clocks) ?
>>>>>>>>>  >
>>>>>>>>>  >         Relying on synchronized clocks are almost certainly
>>>>>>>>>  >         non-linearizable, because clock timestamps cannot
>>>>>>>>>  >         be guaranteed to be consistent with actual event
>>>>>>>>>  >         ordering due to clock skew. isn't it?
>>>>>>>>>  >
>>>>>>>>>  >         Thanks!
>>>>>>>>>  >
>>>>>>>>>  >     --
>>>>>>>>>  >
>>>>>>>>>  >     Justin Cameron
>>>>>>>>>  >
>>>>>>>>>  >     Senior Software Engineer | Instaclustr
>>>>>>>>>  >
>>>>>>>>>  >
>>>>>>>>>  >
>>>>>>>>>  >
>>>>>>>>> >     This email has been sent on behalf of Instaclustr Pty
>>>>>>>>> >     Ltd
>>>>>>>>> >     (Australia) and Instaclustr Inc (USA).

>>>>>>>>> >

>>>>>>>>> >     This email and any attachments may contain confidential
>>>>>>>>> >     and legally
>>>>>>>>> >     privileged information.  If you are not the intended
>>>>>>>>> >     recipient, do
>>>>>>>>> >     not copy or disclose its content, but please reply to
>>>>>>>>> >     this email
>>>>>>>>> >     immediately and highlight the error to the sender and
>>>>>>>>> >     then
>>>>>>>>> >     immediately delete the message.

>>>>>>>>> >

>>>>>>>>> >

>>>>>>>>> 

>>>>>>>> 

>>>>>>> 

>>>>>> 

>>>>> 

>>>> 

>>>> 

>>>> 

>>>> -- 

>>>> Benjamin Roth

>>>> Prokurist

>>>> 

>>>> Jaumo GmbH · www.jaumo.com

>>>> Wehrstraße 46 · 73035 Göppingen · Germany

>>>> Phone +49 7161 304880-6[3] · Fax +49 7161 304880-1[4]

>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

>>> 




Links:

  1. 
https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22
  2. http://pool.ntp.org/
  3. tel:+49%207161%203048806
  4. tel:+49%207161%203048801

Reply via email to