Hi Jeremy Good to hear that you have a solution. Was curious about the correlation between snapshot creation and timeouts. Wondering if you can change “max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and keep the “snapshot/backup” also.
thanks Jojy > On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <jol...@spscommerce.com> wrote: > > Hi Joris, all, > > We are still at the default timeout values for those that you linked. In the > meantime, since the community pushed us to look at other things besides > evading firewall timeouts, we have disabled snapshot/backups on the VMs and > this has resolved the issue for the past 24 hours on the control group that > we have disabled, which has been the best behavior that we have ever seen. > There was a very close correlation between snapshot creation and mesos-slave > process restart (within minutes) that got us to this point. Apparently, the > snapshot creation and quiesce of the filesystem cause enough disruption to > trigger the default timeouts within mesos. > > We are fine with this solution because Mesos has enabled us to have a more > heterogeneous fleet of servers and backups aren't needed on these hosts. > Mesos for the win, there. > > Thanks to everyone that has contributed on this thread! It was a fun exercise > for me, in the code. It was also useful to hear feedback from the list on > places to look, eventually pushing me to a solution. > -Jeremy > > From: Joris Van Remoortere <jo...@mesosphere.io> > Sent: Wednesday, November 11, 2015 12:56 AM > To: user@mesos.apache.org > Subject: Re: Mesos and Zookeeper TCP keepalive > > Hi Jeremy, > > Can you read the description of these parameters on the master, and possibly > share your values for these flags? > > > It seems from the re-registration attempt on the agent, that the master has > already treated the agent as "failed", and so will tell it to shut down on > any re-registration attempt. > > I'm curious if there is a conflict (or too narrow of a time gap) of timeouts > in your environment to allow re-registration by the agent after the agent > notices it needs to re-establish the connection. > > — > Joris Van Remoortere > Mesosphere > > On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <jol...@spscommerce.com> wrote: > Hi Tommy, Erik, all, > > You are correct in your assumption that I'm trying to solve for a one hour > session expire time on a firewall. For some more background info, our master > cluster is in datacenter X, the slaves in X will stay "up" for days and days. > The slaves in a different datacenter, Y, connected to that master cluster > will stay "up" for about a few days and restart. The master cluster is > healthy, with a stable leader for months (no flapping), same for the ZK > "leader". There are about 35 slaves in datacenter Y. Maybe the firewall > session timer is a red herring because the slave restart is seemingly random > (the slave with the highest uptime is 6 days, but a handful only have uptime > of a day) > > I've started debugging this awhile ago, and the gist of the logs is here: > https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted this back in > October seeking help and Benjamin suggested network issues in both > directions, so I thought firewall. > > Thanks for any hints, > Jeremy > > From: tommy xiao <xia...@gmail.com> > Sent: Tuesday, November 10, 2015 3:07 AM > > To: user@mesos.apache.org > Subject: Re: Mesos and Zookeeper TCP keepalive > > same here , same question with Erik. could you please input more background > info, thanks > > 2015-11-10 15:56 GMT+08:00 Erik Weathers <eweath...@groupon.com>: > It would really help if you (Jeremy) explained the *actual* problem you are > facing. I'm *guessing* that it's a firewall timing out the sessions because > there isn't activity on them for whatever the timeout of the firewall is? > It seems likely to be unreasonably short, given that mesos has constant > activity between master and > slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals. > > - Erik > > On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io> wrote: > Hi Jeremy > Its great that you are making progress but I doubt if this is what you > intend to achieve since network failures are a valid state in distributed > systems. If you think there is a special case you are trying to solve, I > suggest proposing a design document for review. > For ZK client code, I would suggest asking the zookeeper mailing list. > > thanks > -Jojy > >> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> wrote: >> >> Alright, great, I'm making some progress, >> >> I did a simple copy/paste modification and recompiled mesos. The keepalive >> timer is set from slave to master so this is an improvement for me. I didn't >> test the other direction yet - >> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file an >> enhancement request for this since it seems like an improvement for other >> people as well, after some real world testing >> >> I'm having some harder time figuring out the zk client code. I started by >> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my >> change wasn't correct or b) I'm modifying a wrong file, since I just assumed >> using the c client. Is this the correct place? >> >> Thanks much, >> Jeremy >> >> >> From: Jojy Varghese <j...@mesosphere.io> >> Sent: Monday, November 9, 2015 2:09 PM >> To: user@mesos.apache.org >> Subject: Re: Mesos and Zookeeper TCP keepalive >> >> Hi Jeremy >> The “network” code is at "3rdparty/libprocess/include/process/network.hpp” >> , "3rdparty/libprocess/src/poll_socket.hpp/cpp”. >> >> thanks >> jojy >> >> >>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com> wrote: >>> >>> Hi all, >>> >>> Jojy, That is correct, but more specifically a keepalive timer from slave >>> to master and slave to zookeeper. Can you send a link to the portion of the >>> code that builds the socket/connection? Is there any reason to not set the >>> SO_KEEPALIVE option in your opinion? >>> >>> hasodent, I'm not looking for keepalive between zk quorum members, like the >>> ZOOKEEPER JIRA is referencing. >>> >>> Thanks, >>> Jeremy >>> >>> >>> From: Jojy Varghese <j...@mesosphere.io> >>> Sent: Sunday, November 8, 2015 8:37 PM >>> To: user@mesos.apache.org >>> Subject: Re: Mesos and Zookeeper TCP keepalive >>> >>> Hi Jeremy >>> Are you trying to establish a keepalive timer between mesos master and >>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE >>> option is not set on an accepting socket. >>> >>> -Jojy >>> >>>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com> wrote: >>>> >>>> I think keepalive option should be set in Zookeeper, not in Mesos. See >>>> this related issue in Zookeeper. >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085 >>>> >>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com> >>>> wrote: >>>> Hello all, >>>> >>>> We have been fighting some network/session disconnection issues between >>>> datacenters and I'm curious if there is anyway to enable tcp keepalive on >>>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp >>>> kernel settings would be used. I believe keepalive has to be enabled by >>>> the software which is opening the connection. (That is my understanding >>>> anyway) >>>> >>>> Here is what I see via netstat --timers -tn: >>>> tcp 0 0 172.18.1.1:55842 10.10.1.1:2181 ESTABLISHED >>>> off (0.00/0/0) >>>> tcp 0 0 172.18.1.1:49702 10.10.1.1:5050 ESTABLISHED >>>> off (0.00/0/0) >>>> >>>> >>>> Where 172 is the mesos-slave network and 10 is the mesos-master network. >>>> The "off" keyword means that keepalive's are not being sent. >>>> >>>> I've trolled through JIRA, git, etc and cannot easily determine if this is >>>> expected behavior or should be an enhancement request. Any ideas? >>>> >>>> Thanks much! >>>> -Jeremy >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Haosdent Huang > > > > > > -- > Deshi Xiao > Twitter: xds2000 > E-mail: xiaods(AT)gmail.com