Re: Mesos and Zookeeper TCP keepalive

tommy xiao Thu, 12 Nov 2015 20:42:53 -0800

Jojy, Thanks for your clarify! cool!

2015-11-13 9:00 GMT+08:00 Jojy Varghese <[email protected]>:


> Sorry for confusing you. I meant that you could maybe change your
> “max_slave_ping_timeouts” / “slave_ping_timeout” values and re-enable
> snapshots.
>
> -Jojy
>
> On Nov 12, 2015, at 3:30 PM, tommy xiao <[email protected]> wrote:
>
> Hi Jojy
>
> what mean for keep the “snapshot/backup” ? could you please give some docs
> to ref
>
> 2015-11-13 1:59 GMT+08:00 Jojy Varghese <[email protected]>:
>
>> Hi Jeremy
>>  Good to hear that you have a solution. Was curious about the correlation
>> between snapshot creation and timeouts. Wondering if you can change
>> “max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and
>> keep the “snapshot/backup” also.
>>
>> thanks
>> Jojy
>>
>>
>> > On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <[email protected]>
>> wrote:
>> >
>> > Hi Joris, all,
>> >
>> > We are still at the default timeout values for those that you linked.
>> In the meantime, since the community pushed us to look at other things
>> besides evading firewall timeouts, we have disabled snapshot/backups on the
>> VMs and this has resolved the issue for the past 24 hours on the control
>> group that we have disabled, which has been the best behavior that we have
>> ever seen. There was a very close correlation between snapshot creation and
>> mesos-slave process restart (within minutes) that got us to this point.
>> Apparently, the snapshot creation and quiesce of the filesystem cause
>> enough disruption to trigger the default timeouts within mesos.
>> >
>> > We are fine with this solution because Mesos has enabled us to have a
>> more heterogeneous fleet of servers and backups aren't needed on these
>> hosts. Mesos for the win, there.
>> >
>> > Thanks to everyone that has contributed on this thread! It was a fun
>> exercise for me, in the code. It was also useful to hear feedback from the
>> list on places to look, eventually pushing me to a solution.
>> > -Jeremy
>> >
>> > From: Joris Van Remoortere <[email protected]>
>> > Sent: Wednesday, November 11, 2015 12:56 AM
>> > To: [email protected]
>> > Subject: Re: Mesos and Zookeeper TCP keepalive
>> >
>> > Hi Jeremy,
>> >
>> > Can you read the description of these parameters on the master, and
>> possibly share your values for these flags?
>> >
>> >
>> > It seems from the re-registration attempt on the agent, that the master
>> has already treated the agent as "failed", and so will tell it to shut down
>> on any re-registration attempt.
>> >
>> > I'm curious if there is a conflict (or too narrow of a time gap) of
>> timeouts in your environment to allow re-registration by the agent after
>> the agent notices it needs to re-establish the connection.
>> >
>> > —
>> > Joris Van Remoortere
>> > Mesosphere
>> >
>> > On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <[email protected]>
>> wrote:
>> > Hi Tommy, Erik, all,
>> >
>> > You are correct in your assumption that I'm trying to solve for a one
>> hour session expire time on a firewall. For some more background info, our
>> master cluster is in datacenter X, the slaves in X will stay "up" for days
>> and days. The slaves in a different datacenter, Y, connected to that master
>> cluster will stay "up" for about a few days and restart. The master cluster
>> is healthy, with a stable leader for months (no flapping), same for the ZK
>> "leader". There are about 35 slaves in datacenter Y. Maybe the firewall
>> session timer is a red herring because the slave restart is seemingly
>> random (the slave with the highest uptime is 6 days, but a handful only
>> have uptime of a day)
>> >
>> > I've started debugging this awhile ago, and the gist of the logs is
>> here: https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted
>> this back in October seeking help and Benjamin suggested network issues in
>> both directions, so I thought firewall.
>> >
>> > Thanks for any hints,
>> > Jeremy
>> >
>> > From: tommy xiao <[email protected]>
>> > Sent: Tuesday, November 10, 2015 3:07 AM
>> >
>> > To: [email protected]
>> > Subject: Re: Mesos and Zookeeper TCP keepalive
>> >
>> > same here , same question with Erik. could you please input more
>> background info, thanks
>> >
>> > 2015-11-10 15:56 GMT+08:00 Erik Weathers <[email protected]>:
>> > It would really help if you (Jeremy) explained the *actual* problem you
>> are facing.  I'm *guessing* that it's a firewall timing out the sessions
>> because there isn't activity on them for whatever the timeout of the
>> firewall is?   It seems likely to be unreasonably short, given that mesos
>> has constant activity between master and
>> slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals.
>> >
>> > - Erik
>> >
>> > On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <[email protected]>
>> wrote:
>> > Hi Jeremy
>> >  Its great that you are making progress but I doubt if this is what you
>> intend to achieve since network failures are a valid state in distributed
>> systems. If you think there is a special case you are trying to solve, I
>> suggest proposing a design document for review.
>> >   For ZK client code, I would suggest asking the zookeeper mailing list.
>> >
>> > thanks
>> > -Jojy
>> >
>> >> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <[email protected]>
>> wrote:
>> >>
>> >> Alright, great, I'm making some progress,
>> >>
>> >> I did a simple copy/paste modification and recompiled mesos. The
>> keepalive timer is set from slave to master so this is an improvement for
>> me. I didn't test the other direction yet -
>> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file
>> an enhancement request for this since it seems like an improvement for
>> other people as well, after some real world testing
>> >>
>> >> I'm having some harder time figuring out the zk client code. I started
>> by modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a)
>> my change wasn't correct or b) I'm modifying a wrong file, since I just
>> assumed using the c client. Is this the correct place?
>> >>
>> >> Thanks much,
>> >> Jeremy
>> >>
>> >>
>> >> From: Jojy Varghese <[email protected]>
>> >> Sent: Monday, November 9, 2015 2:09 PM
>> >> To: [email protected]
>> >> Subject: Re: Mesos and Zookeeper TCP keepalive
>> >>
>> >> Hi Jeremy
>> >>  The “network” code is at
>> "3rdparty/libprocess/include/process/network.hpp” ,
>> "3rdparty/libprocess/src/poll_socket.hpp/cpp”.
>> >>
>> >> thanks
>> >> jojy
>> >>
>> >>
>> >>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <[email protected]>
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> Jojy, That is correct, but more specifically a keepalive timer from
>> slave to master and slave to zookeeper. Can you send a link to the portion
>> of the code that builds the socket/connection? Is there any reason to not
>> set the SO_KEEPALIVE option in your opinion?
>> >>>
>> >>> hasodent, I'm not looking for keepalive between zk quorum members,
>> like the ZOOKEEPER JIRA is referencing.
>> >>>
>> >>> Thanks,
>> >>> Jeremy
>> >>>
>> >>>
>> >>> From: Jojy Varghese <[email protected]>
>> >>> Sent: Sunday, November 8, 2015 8:37 PM
>> >>> To: [email protected]
>> >>> Subject: Re: Mesos and Zookeeper TCP keepalive
>> >>>
>> >>> Hi Jeremy
>> >>>   Are you trying to establish a keepalive timer between mesos master
>> and mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE
>> option is  not set on an accepting socket.
>> >>>
>> >>> -Jojy
>> >>>
>> >>>> On Nov 8, 2015, at 8:43 AM, haosdent <[email protected]> wrote:
>> >>>>
>> >>>> I think keepalive option should be set in Zookeeper, not in Mesos.
>> See this related issue in Zookeeper.
>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>> >>>>
>> >>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <[email protected]>
>> wrote:
>> >>>> Hello all,
>> >>>>
>> >>>> We have been fighting some network/session disconnection issues
>> between datacenters and I'm curious if there is anyway to enable tcp
>> keepalive on the zookeeper/mesos sockets? If there was a way, then the
>> sysctl tcp kernel settings would be used. I believe keepalive has to be
>> enabled by the software which is opening the connection. (That is my
>> understanding anyway)
>> >>>>
>> >>>> Here is what I see via netstat --timers -tn:
>> >>>> tcp        0      0 172.18.1.1:55842      10.10.1.1:2181
>> ESTABLISHED off (0.00/0/0)
>> >>>> tcp        0      0 172.18.1.1:49702      10.10.1.1:5050
>> ESTABLISHED off (0.00/0/0)
>> >>>>
>> >>>>
>> >>>> Where 172 is the mesos-slave network and 10 is the mesos-master
>> network. The "off" keyword means that keepalive's are not being sent.
>> >>>>
>> >>>> I've trolled through JIRA, git, etc and cannot easily determine if
>> this is expected behavior or should be an enhancement request. Any ideas?
>> >>>>
>> >>>> Thanks much!
>> >>>> -Jeremy
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Best Regards,
>> >>>> Haosdent Huang
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Deshi Xiao
>> > Twitter: xds2000
>> > E-mail: xiaods(AT)gmail.com
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>
>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: Mesos and Zookeeper TCP keepalive

Reply via email to