Re: Mesos and Zookeeper TCP keepalive

Jojy Varghese Thu, 12 Nov 2015 17:00:51 -0800

Sorry for confusing you. I meant that you could maybe change your 
“max_slave_ping_timeouts” / “slave_ping_timeout” values and re-enable snapshots.


-Jojy

> On Nov 12, 2015, at 3:30 PM, tommy xiao <xia...@gmail.com> wrote:
> 
> Hi Jojy
> 
> what mean for keep the “snapshot/backup” ? could you please give some docs to 
> ref
> 
> 2015-11-13 1:59 GMT+08:00 Jojy Varghese <j...@mesosphere.io 
> <mailto:j...@mesosphere.io>>:
> Hi Jeremy
>  Good to hear that you have a solution. Was curious about the correlation 
> between snapshot creation and timeouts. Wondering if you can change 
> “max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and keep 
> the “snapshot/backup” also.
> 
> thanks
> Jojy
> 
> 
> > On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <jol...@spscommerce.com 
> > <mailto:jol...@spscommerce.com>> wrote:
> >
> > Hi Joris, all,
> >
> > We are still at the default timeout values for those that you linked. In 
> > the meantime, since the community pushed us to look at other things besides 
> > evading firewall timeouts, we have disabled snapshot/backups on the VMs and 
> > this has resolved the issue for the past 24 hours on the control group that 
> > we have disabled, which has been the best behavior that we have ever seen. 
> > There was a very close correlation between snapshot creation and 
> > mesos-slave process restart (within minutes) that got us to this point. 
> > Apparently, the snapshot creation and quiesce of the filesystem cause 
> > enough disruption to trigger the default timeouts within mesos.
> >
> > We are fine with this solution because Mesos has enabled us to have a more 
> > heterogeneous fleet of servers and backups aren't needed on these hosts. 
> > Mesos for the win, there.
> >
> > Thanks to everyone that has contributed on this thread! It was a fun 
> > exercise for me, in the code. It was also useful to hear feedback from the 
> > list on places to look, eventually pushing me to a solution.
> > -Jeremy
> >
> > From: Joris Van Remoortere <jo...@mesosphere.io 
> > <mailto:jo...@mesosphere.io>>
> > Sent: Wednesday, November 11, 2015 12:56 AM
> > To: user@mesos.apache.org <mailto:user@mesos.apache.org>
> > Subject: Re: Mesos and Zookeeper TCP keepalive
> >
> > Hi Jeremy,
> >
> > Can you read the description of these parameters on the master, and 
> > possibly share your values for these flags?
> >
> >
> > It seems from the re-registration attempt on the agent, that the master has 
> > already treated the agent as "failed", and so will tell it to shut down on 
> > any re-registration attempt.
> >
> > I'm curious if there is a conflict (or too narrow of a time gap) of 
> > timeouts in your environment to allow re-registration by the agent after 
> > the agent notices it needs to re-establish the connection.
> >
> > —
> > Joris Van Remoortere
> > Mesosphere
> >
> > On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <jol...@spscommerce.com 
> > <mailto:jol...@spscommerce.com>> wrote:
> > Hi Tommy, Erik, all,
> >
> > You are correct in your assumption that I'm trying to solve for a one hour 
> > session expire time on a firewall. For some more background info, our 
> > master cluster is in datacenter X, the slaves in X will stay "up" for days 
> > and days. The slaves in a different datacenter, Y, connected to that master 
> > cluster will stay "up" for about a few days and restart. The master cluster 
> > is healthy, with a stable leader for months (no flapping), same for the ZK 
> > "leader". There are about 35 slaves in datacenter Y. Maybe the firewall 
> > session timer is a red herring because the slave restart is seemingly 
> > random (the slave with the highest uptime is 6 days, but a handful only 
> > have uptime of a day)
> >
> > I've started debugging this awhile ago, and the gist of the logs is here: 
> > https://gist.github.com/jolexa/1a80e26a4b017846d083 
> > <https://gist.github.com/jolexa/1a80e26a4b017846d083> I've posted this back 
> > in October seeking help and Benjamin suggested network issues in both 
> > directions, so I thought firewall.
> >
> > Thanks for any hints,
> > Jeremy
> >
> > From: tommy xiao <xia...@gmail.com <mailto:xia...@gmail.com>>
> > Sent: Tuesday, November 10, 2015 3:07 AM
> >
> > To: user@mesos.apache.org <mailto:user@mesos.apache.org>
> > Subject: Re: Mesos and Zookeeper TCP keepalive
> >
> > same here , same question with Erik. could you please input more background 
> > info, thanks
> >
> > 2015-11-10 15:56 GMT+08:00 Erik Weathers <eweath...@groupon.com 
> > <mailto:eweath...@groupon.com>>:
> > It would really help if you (Jeremy) explained the *actual* problem you are 
> > facing.  I'm *guessing* that it's a firewall timing out the sessions 
> > because there isn't activity on them for whatever the timeout of the 
> > firewall is?   It seems likely to be unreasonably short, given that mesos 
> > has constant activity between master and 
> > slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals.
> >
> > - Erik
> >
> > On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io 
> > <mailto:j...@mesosphere.io>> wrote:
> > Hi Jeremy
> >  Its great that you are making progress but I doubt if this is what you 
> > intend to achieve since network failures are a valid state in distributed 
> > systems. If you think there is a special case you are trying to solve, I 
> > suggest proposing a design document for review.
> >   For ZK client code, I would suggest asking the zookeeper mailing list.
> >
> > thanks
> > -Jojy
> >
> >> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com 
> >> <mailto:jol...@spscommerce.com>> wrote:
> >>
> >> Alright, great, I'm making some progress,
> >>
> >> I did a simple copy/paste modification and recompiled mesos. The keepalive 
> >> timer is set from slave to master so this is an improvement for me. I 
> >> didn't test the other direction yet - 
> >> https://gist.github.com/jolexa/ee9e152aa7045c558e02 
> >> <https://gist.github.com/jolexa/ee9e152aa7045c558e02> - I'd like to file 
> >> an enhancement request for this since it seems like an improvement for 
> >> other people as well, after some real world testing
> >>
> >> I'm having some harder time figuring out the zk client code. I started by 
> >> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) 
> >> my change wasn't correct or b) I'm modifying a wrong file, since I just 
> >> assumed using the c client. Is this the correct place?
> >>
> >> Thanks much,
> >> Jeremy
> >>
> >>
> >> From: Jojy Varghese <j...@mesosphere.io <mailto:j...@mesosphere.io>>
> >> Sent: Monday, November 9, 2015 2:09 PM
> >> To: user@mesos.apache.org <mailto:user@mesos.apache.org>
> >> Subject: Re: Mesos and Zookeeper TCP keepalive
> >>
> >> Hi Jeremy
> >>  The “network” code is at 
> >> "3rdparty/libprocess/include/process/network.hpp” , 
> >> "3rdparty/libprocess/src/poll_socket.hpp/cpp”.
> >>
> >> thanks
> >> jojy
> >>
> >>
> >>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com 
> >>> <mailto:jol...@spscommerce.com>> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Jojy, That is correct, but more specifically a keepalive timer from slave 
> >>> to master and slave to zookeeper. Can you send a link to the portion of 
> >>> the code that builds the socket/connection? Is there any reason to not 
> >>> set the SO_KEEPALIVE option in your opinion?
> >>>
> >>> hasodent, I'm not looking for keepalive between zk quorum members, like 
> >>> the ZOOKEEPER JIRA is referencing.
> >>>
> >>> Thanks,
> >>> Jeremy
> >>>
> >>>
> >>> From: Jojy Varghese <j...@mesosphere.io <mailto:j...@mesosphere.io>>
> >>> Sent: Sunday, November 8, 2015 8:37 PM
> >>> To: user@mesos.apache.org <mailto:user@mesos.apache.org>
> >>> Subject: Re: Mesos and Zookeeper TCP keepalive
> >>>
> >>> Hi Jeremy
> >>>   Are you trying to establish a keepalive timer between mesos master and 
> >>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE 
> >>> option is  not set on an accepting socket.
> >>>
> >>> -Jojy
> >>>
> >>>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com 
> >>>> <mailto:haosd...@gmail.com>> wrote:
> >>>>
> >>>> I think keepalive option should be set in Zookeeper, not in Mesos. See 
> >>>> this related issue in Zookeeper. 
> >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
> >>>>  
> >>>> <https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085>
> >>>>
> >>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com 
> >>>> <mailto:jol...@spscommerce.com>> wrote:
> >>>> Hello all,
> >>>>
> >>>> We have been fighting some network/session disconnection issues between 
> >>>> datacenters and I'm curious if there is anyway to enable tcp keepalive 
> >>>> on the zookeeper/mesos sockets? If there was a way, then the sysctl tcp 
> >>>> kernel settings would be used. I believe keepalive has to be enabled by 
> >>>> the software which is opening the connection. (That is my understanding 
> >>>> anyway)
> >>>>
> >>>> Here is what I see via netstat --timers -tn:
> >>>> tcp        0      0 172.18.1.1:55842 <http://172.18.1.1:55842/>      
> >>>> 10.10.1.1:2181 <http://10.10.1.1:2181/>      ESTABLISHED off (0.00/0/0)
> >>>> tcp        0      0 172.18.1.1:49702 <http://172.18.1.1:49702/>      
> >>>> 10.10.1.1:5050 <http://10.10.1.1:5050/>      ESTABLISHED off (0.00/0/0)
> >>>>
> >>>>
> >>>> Where 172 is the mesos-slave network and 10 is the mesos-master network. 
> >>>> The "off" keyword means that keepalive's are not being sent.
> >>>>
> >>>> I've trolled through JIRA, git, etc and cannot easily determine if this 
> >>>> is expected behavior or should be an enhancement request. Any ideas?
> >>>>
> >>>> Thanks much!
> >>>> -Jeremy
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best Regards,
> >>>> Haosdent Huang
> >
> >
> >
> >
> >
> > --
> > Deshi Xiao
> > Twitter: xds2000
> > E-mail: xiaods(AT)gmail.com <http://gmail.com/>
> 
> 
> 
> 
> -- 
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com <http://gmail.com/>

Re: Mesos and Zookeeper TCP keepalive

Reply via email to