Jojy, Thanks for your clarify! cool! 2015-11-13 9:00 GMT+08:00 Jojy Varghese <j...@mesosphere.io>:
> Sorry for confusing you. I meant that you could maybe change your > “max_slave_ping_timeouts” / “slave_ping_timeout” values and re-enable > snapshots. > > -Jojy > > On Nov 12, 2015, at 3:30 PM, tommy xiao <xia...@gmail.com> wrote: > > Hi Jojy > > what mean for keep the “snapshot/backup” ? could you please give some docs > to ref > > 2015-11-13 1:59 GMT+08:00 Jojy Varghese <j...@mesosphere.io>: > >> Hi Jeremy >> Good to hear that you have a solution. Was curious about the correlation >> between snapshot creation and timeouts. Wondering if you can change >> “max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and >> keep the “snapshot/backup” also. >> >> thanks >> Jojy >> >> >> > On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <jol...@spscommerce.com> >> wrote: >> > >> > Hi Joris, all, >> > >> > We are still at the default timeout values for those that you linked. >> In the meantime, since the community pushed us to look at other things >> besides evading firewall timeouts, we have disabled snapshot/backups on the >> VMs and this has resolved the issue for the past 24 hours on the control >> group that we have disabled, which has been the best behavior that we have >> ever seen. There was a very close correlation between snapshot creation and >> mesos-slave process restart (within minutes) that got us to this point. >> Apparently, the snapshot creation and quiesce of the filesystem cause >> enough disruption to trigger the default timeouts within mesos. >> > >> > We are fine with this solution because Mesos has enabled us to have a >> more heterogeneous fleet of servers and backups aren't needed on these >> hosts. Mesos for the win, there. >> > >> > Thanks to everyone that has contributed on this thread! It was a fun >> exercise for me, in the code. It was also useful to hear feedback from the >> list on places to look, eventually pushing me to a solution. >> > -Jeremy >> > >> > From: Joris Van Remoortere <jo...@mesosphere.io> >> > Sent: Wednesday, November 11, 2015 12:56 AM >> > To: user@mesos.apache.org >> > Subject: Re: Mesos and Zookeeper TCP keepalive >> > >> > Hi Jeremy, >> > >> > Can you read the description of these parameters on the master, and >> possibly share your values for these flags? >> > >> > >> > It seems from the re-registration attempt on the agent, that the master >> has already treated the agent as "failed", and so will tell it to shut down >> on any re-registration attempt. >> > >> > I'm curious if there is a conflict (or too narrow of a time gap) of >> timeouts in your environment to allow re-registration by the agent after >> the agent notices it needs to re-establish the connection. >> > >> > — >> > Joris Van Remoortere >> > Mesosphere >> > >> > On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <jol...@spscommerce.com> >> wrote: >> > Hi Tommy, Erik, all, >> > >> > You are correct in your assumption that I'm trying to solve for a one >> hour session expire time on a firewall. For some more background info, our >> master cluster is in datacenter X, the slaves in X will stay "up" for days >> and days. The slaves in a different datacenter, Y, connected to that master >> cluster will stay "up" for about a few days and restart. The master cluster >> is healthy, with a stable leader for months (no flapping), same for the ZK >> "leader". There are about 35 slaves in datacenter Y. Maybe the firewall >> session timer is a red herring because the slave restart is seemingly >> random (the slave with the highest uptime is 6 days, but a handful only >> have uptime of a day) >> > >> > I've started debugging this awhile ago, and the gist of the logs is >> here: https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted >> this back in October seeking help and Benjamin suggested network issues in >> both directions, so I thought firewall. >> > >> > Thanks for any hints, >> > Jeremy >> > >> > From: tommy xiao <xia...@gmail.com> >> > Sent: Tuesday, November 10, 2015 3:07 AM >> > >> > To: user@mesos.apache.org >> > Subject: Re: Mesos and Zookeeper TCP keepalive >> > >> > same here , same question with Erik. could you please input more >> background info, thanks >> > >> > 2015-11-10 15:56 GMT+08:00 Erik Weathers <eweath...@groupon.com>: >> > It would really help if you (Jeremy) explained the *actual* problem you >> are facing. I'm *guessing* that it's a firewall timing out the sessions >> because there isn't activity on them for whatever the timeout of the >> firewall is? It seems likely to be unreasonably short, given that mesos >> has constant activity between master and >> slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals. >> > >> > - Erik >> > >> > On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io> >> wrote: >> > Hi Jeremy >> > Its great that you are making progress but I doubt if this is what you >> intend to achieve since network failures are a valid state in distributed >> systems. If you think there is a special case you are trying to solve, I >> suggest proposing a design document for review. >> > For ZK client code, I would suggest asking the zookeeper mailing list. >> > >> > thanks >> > -Jojy >> > >> >> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> >> wrote: >> >> >> >> Alright, great, I'm making some progress, >> >> >> >> I did a simple copy/paste modification and recompiled mesos. The >> keepalive timer is set from slave to master so this is an improvement for >> me. I didn't test the other direction yet - >> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file >> an enhancement request for this since it seems like an improvement for >> other people as well, after some real world testing >> >> >> >> I'm having some harder time figuring out the zk client code. I started >> by modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) >> my change wasn't correct or b) I'm modifying a wrong file, since I just >> assumed using the c client. Is this the correct place? >> >> >> >> Thanks much, >> >> Jeremy >> >> >> >> >> >> From: Jojy Varghese <j...@mesosphere.io> >> >> Sent: Monday, November 9, 2015 2:09 PM >> >> To: user@mesos.apache.org >> >> Subject: Re: Mesos and Zookeeper TCP keepalive >> >> >> >> Hi Jeremy >> >> The “network” code is at >> "3rdparty/libprocess/include/process/network.hpp” , >> "3rdparty/libprocess/src/poll_socket.hpp/cpp”. >> >> >> >> thanks >> >> jojy >> >> >> >> >> >>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com> >> wrote: >> >>> >> >>> Hi all, >> >>> >> >>> Jojy, That is correct, but more specifically a keepalive timer from >> slave to master and slave to zookeeper. Can you send a link to the portion >> of the code that builds the socket/connection? Is there any reason to not >> set the SO_KEEPALIVE option in your opinion? >> >>> >> >>> hasodent, I'm not looking for keepalive between zk quorum members, >> like the ZOOKEEPER JIRA is referencing. >> >>> >> >>> Thanks, >> >>> Jeremy >> >>> >> >>> >> >>> From: Jojy Varghese <j...@mesosphere.io> >> >>> Sent: Sunday, November 8, 2015 8:37 PM >> >>> To: user@mesos.apache.org >> >>> Subject: Re: Mesos and Zookeeper TCP keepalive >> >>> >> >>> Hi Jeremy >> >>> Are you trying to establish a keepalive timer between mesos master >> and mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE >> option is not set on an accepting socket. >> >>> >> >>> -Jojy >> >>> >> >>>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com> wrote: >> >>>> >> >>>> I think keepalive option should be set in Zookeeper, not in Mesos. >> See this related issue in Zookeeper. >> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085 >> >>>> >> >>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com> >> wrote: >> >>>> Hello all, >> >>>> >> >>>> We have been fighting some network/session disconnection issues >> between datacenters and I'm curious if there is anyway to enable tcp >> keepalive on the zookeeper/mesos sockets? If there was a way, then the >> sysctl tcp kernel settings would be used. I believe keepalive has to be >> enabled by the software which is opening the connection. (That is my >> understanding anyway) >> >>>> >> >>>> Here is what I see via netstat --timers -tn: >> >>>> tcp 0 0 172.18.1.1:55842 10.10.1.1:2181 >> ESTABLISHED off (0.00/0/0) >> >>>> tcp 0 0 172.18.1.1:49702 10.10.1.1:5050 >> ESTABLISHED off (0.00/0/0) >> >>>> >> >>>> >> >>>> Where 172 is the mesos-slave network and 10 is the mesos-master >> network. The "off" keyword means that keepalive's are not being sent. >> >>>> >> >>>> I've trolled through JIRA, git, etc and cannot easily determine if >> this is expected behavior or should be an enhancement request. Any ideas? >> >>>> >> >>>> Thanks much! >> >>>> -Jeremy >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Best Regards, >> >>>> Haosdent Huang >> > >> > >> > >> > >> > >> > -- >> > Deshi Xiao >> > Twitter: xds2000 >> > E-mail: xiaods(AT)gmail.com >> >> > > > -- > Deshi Xiao > Twitter: xds2000 > E-mail: xiaods(AT)gmail.com > > > -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com