Hi Jojy what mean for keep the “snapshot/backup” ? could you please give some docs to ref
2015-11-13 1:59 GMT+08:00 Jojy Varghese <j...@mesosphere.io>: > Hi Jeremy > Good to hear that you have a solution. Was curious about the correlation > between snapshot creation and timeouts. Wondering if you can change > “max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and > keep the “snapshot/backup” also. > > thanks > Jojy > > > > On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <jol...@spscommerce.com> > wrote: > > > > Hi Joris, all, > > > > We are still at the default timeout values for those that you linked. In > the meantime, since the community pushed us to look at other things besides > evading firewall timeouts, we have disabled snapshot/backups on the VMs and > this has resolved the issue for the past 24 hours on the control group that > we have disabled, which has been the best behavior that we have ever seen. > There was a very close correlation between snapshot creation and > mesos-slave process restart (within minutes) that got us to this point. > Apparently, the snapshot creation and quiesce of the filesystem cause > enough disruption to trigger the default timeouts within mesos. > > > > We are fine with this solution because Mesos has enabled us to have a > more heterogeneous fleet of servers and backups aren't needed on these > hosts. Mesos for the win, there. > > > > Thanks to everyone that has contributed on this thread! It was a fun > exercise for me, in the code. It was also useful to hear feedback from the > list on places to look, eventually pushing me to a solution. > > -Jeremy > > > > From: Joris Van Remoortere <jo...@mesosphere.io> > > Sent: Wednesday, November 11, 2015 12:56 AM > > To: user@mesos.apache.org > > Subject: Re: Mesos and Zookeeper TCP keepalive > > > > Hi Jeremy, > > > > Can you read the description of these parameters on the master, and > possibly share your values for these flags? > > > > > > It seems from the re-registration attempt on the agent, that the master > has already treated the agent as "failed", and so will tell it to shut down > on any re-registration attempt. > > > > I'm curious if there is a conflict (or too narrow of a time gap) of > timeouts in your environment to allow re-registration by the agent after > the agent notices it needs to re-establish the connection. > > > > — > > Joris Van Remoortere > > Mesosphere > > > > On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <jol...@spscommerce.com> > wrote: > > Hi Tommy, Erik, all, > > > > You are correct in your assumption that I'm trying to solve for a one > hour session expire time on a firewall. For some more background info, our > master cluster is in datacenter X, the slaves in X will stay "up" for days > and days. The slaves in a different datacenter, Y, connected to that master > cluster will stay "up" for about a few days and restart. The master cluster > is healthy, with a stable leader for months (no flapping), same for the ZK > "leader". There are about 35 slaves in datacenter Y. Maybe the firewall > session timer is a red herring because the slave restart is seemingly > random (the slave with the highest uptime is 6 days, but a handful only > have uptime of a day) > > > > I've started debugging this awhile ago, and the gist of the logs is > here: https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted > this back in October seeking help and Benjamin suggested network issues in > both directions, so I thought firewall. > > > > Thanks for any hints, > > Jeremy > > > > From: tommy xiao <xia...@gmail.com> > > Sent: Tuesday, November 10, 2015 3:07 AM > > > > To: user@mesos.apache.org > > Subject: Re: Mesos and Zookeeper TCP keepalive > > > > same here , same question with Erik. could you please input more > background info, thanks > > > > 2015-11-10 15:56 GMT+08:00 Erik Weathers <eweath...@groupon.com>: > > It would really help if you (Jeremy) explained the *actual* problem you > are facing. I'm *guessing* that it's a firewall timing out the sessions > because there isn't activity on them for whatever the timeout of the > firewall is? It seems likely to be unreasonably short, given that mesos > has constant activity between master and > slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals. > > > > - Erik > > > > On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io> > wrote: > > Hi Jeremy > > Its great that you are making progress but I doubt if this is what you > intend to achieve since network failures are a valid state in distributed > systems. If you think there is a special case you are trying to solve, I > suggest proposing a design document for review. > > For ZK client code, I would suggest asking the zookeeper mailing list. > > > > thanks > > -Jojy > > > >> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> > wrote: > >> > >> Alright, great, I'm making some progress, > >> > >> I did a simple copy/paste modification and recompiled mesos. The > keepalive timer is set from slave to master so this is an improvement for > me. I didn't test the other direction yet - > https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file an > enhancement request for this since it seems like an improvement for other > people as well, after some real world testing > >> > >> I'm having some harder time figuring out the zk client code. I started > by modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) > my change wasn't correct or b) I'm modifying a wrong file, since I just > assumed using the c client. Is this the correct place? > >> > >> Thanks much, > >> Jeremy > >> > >> > >> From: Jojy Varghese <j...@mesosphere.io> > >> Sent: Monday, November 9, 2015 2:09 PM > >> To: user@mesos.apache.org > >> Subject: Re: Mesos and Zookeeper TCP keepalive > >> > >> Hi Jeremy > >> The “network” code is at > "3rdparty/libprocess/include/process/network.hpp” , > "3rdparty/libprocess/src/poll_socket.hpp/cpp”. > >> > >> thanks > >> jojy > >> > >> > >>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com> > wrote: > >>> > >>> Hi all, > >>> > >>> Jojy, That is correct, but more specifically a keepalive timer from > slave to master and slave to zookeeper. Can you send a link to the portion > of the code that builds the socket/connection? Is there any reason to not > set the SO_KEEPALIVE option in your opinion? > >>> > >>> hasodent, I'm not looking for keepalive between zk quorum members, > like the ZOOKEEPER JIRA is referencing. > >>> > >>> Thanks, > >>> Jeremy > >>> > >>> > >>> From: Jojy Varghese <j...@mesosphere.io> > >>> Sent: Sunday, November 8, 2015 8:37 PM > >>> To: user@mesos.apache.org > >>> Subject: Re: Mesos and Zookeeper TCP keepalive > >>> > >>> Hi Jeremy > >>> Are you trying to establish a keepalive timer between mesos master > and mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE > option is not set on an accepting socket. > >>> > >>> -Jojy > >>> > >>>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com> wrote: > >>>> > >>>> I think keepalive option should be set in Zookeeper, not in Mesos. > See this related issue in Zookeeper. > https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085 > >>>> > >>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com> > wrote: > >>>> Hello all, > >>>> > >>>> We have been fighting some network/session disconnection issues > between datacenters and I'm curious if there is anyway to enable tcp > keepalive on the zookeeper/mesos sockets? If there was a way, then the > sysctl tcp kernel settings would be used. I believe keepalive has to be > enabled by the software which is opening the connection. (That is my > understanding anyway) > >>>> > >>>> Here is what I see via netstat --timers -tn: > >>>> tcp 0 0 172.18.1.1:55842 10.10.1.1:2181 > ESTABLISHED off (0.00/0/0) > >>>> tcp 0 0 172.18.1.1:49702 10.10.1.1:5050 > ESTABLISHED off (0.00/0/0) > >>>> > >>>> > >>>> Where 172 is the mesos-slave network and 10 is the mesos-master > network. The "off" keyword means that keepalive's are not being sent. > >>>> > >>>> I've trolled through JIRA, git, etc and cannot easily determine if > this is expected behavior or should be an enhancement request. Any ideas? > >>>> > >>>> Thanks much! > >>>> -Jeremy > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Best Regards, > >>>> Haosdent Huang > > > > > > > > > > > > -- > > Deshi Xiao > > Twitter: xds2000 > > E-mail: xiaods(AT)gmail.com > > -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com