Re: Mesos and Zookeeper TCP keepalive

Jeremy Olexa Fri, 13 Nov 2015 17:59:13 -0800

Jojy,


I will eventually be able to try adjusting those options, but not this moment, 
as it is the busy time.


Thanks again for all the help!

-Jeremy


________________________________
From: tommy xiao <xia...@gmail.com>
Sent: Thursday, November 12, 2015 10:42 PM
To: user@mesos.apache.org
Subject: Re: Mesos and Zookeeper TCP keepalive

Jojy, Thanks for your clarify! cool!

2015-11-13 9:00 GMT+08:00 Jojy Varghese 
<j...@mesosphere.io<mailto:j...@mesosphere.io>>:
Sorry for confusing you. I meant that you could maybe change your 
“max_slave_ping_timeouts” / “slave_ping_timeout” values and re-enable snapshots.

-Jojy

On Nov 12, 2015, at 3:30 PM, tommy xiao 
<xia...@gmail.com<mailto:xia...@gmail.com>> wrote:

Hi Jojy

what mean for keep the “snapshot/backup” ? could you please give some docs to 
ref

2015-11-13 1:59 GMT+08:00 Jojy Varghese 
<j...@mesosphere.io<mailto:j...@mesosphere.io>>:
Hi Jeremy
 Good to hear that you have a solution. Was curious about the correlation 
between snapshot creation and timeouts. Wondering if you can change 
“max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and keep 
the “snapshot/backup” also.

thanks
Jojy


> On Nov 11, 2015, at 6:04 PM, Jeremy Olexa 
> <jol...@spscommerce.com<mailto:jol...@spscommerce.com>> wrote:
>
> Hi Joris, all,
>
> We are still at the default timeout values for those that you linked. In the 
> meantime, since the community pushed us to look at other things besides 
> evading firewall timeouts, we have disabled snapshot/backups on the VMs and 
> this has resolved the issue for the past 24 hours on the control group that 
> we have disabled, which has been the best behavior that we have ever seen. 
> There was a very close correlation between snapshot creation and mesos-slave 
> process restart (within minutes) that got us to this point. Apparently, the 
> snapshot creation and quiesce of the filesystem cause enough disruption to 
> trigger the default timeouts within mesos.
>
> We are fine with this solution because Mesos has enabled us to have a more 
> heterogeneous fleet of servers and backups aren't needed on these hosts. 
> Mesos for the win, there.
>
> Thanks to everyone that has contributed on this thread! It was a fun exercise 
> for me, in the code. It was also useful to hear feedback from the list on 
> places to look, eventually pushing me to a solution.
> -Jeremy
>
> From: Joris Van Remoortere <jo...@mesosphere.io<mailto:jo...@mesosphere.io>>
> Sent: Wednesday, November 11, 2015 12:56 AM
> To: user@mesos.apache.org<mailto:user@mesos.apache.org>
> Subject: Re: Mesos and Zookeeper TCP keepalive
>
> Hi Jeremy,
>
> Can you read the description of these parameters on the master, and possibly 
> share your values for these flags?
>
>
> It seems from the re-registration attempt on the agent, that the master has 
> already treated the agent as "failed", and so will tell it to shut down on 
> any re-registration attempt.
>
> I'm curious if there is a conflict (or too narrow of a time gap) of timeouts 
> in your environment to allow re-registration by the agent after the agent 
> notices it needs to re-establish the connection.
>
> —
> Joris Van Remoortere
> Mesosphere
>
> On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa 
> <jol...@spscommerce.com<mailto:jol...@spscommerce.com>> wrote:
> Hi Tommy, Erik, all,
>
> You are correct in your assumption that I'm trying to solve for a one hour 
> session expire time on a firewall. For some more background info, our master 
> cluster is in datacenter X, the slaves in X will stay "up" for days and days. 
> The slaves in a different datacenter, Y, connected to that master cluster 
> will stay "up" for about a few days and restart. The master cluster is 
> healthy, with a stable leader for months (no flapping), same for the ZK 
> "leader". There are about 35 slaves in datacenter Y. Maybe the firewall 
> session timer is a red herring because the slave restart is seemingly random 
> (the slave with the highest uptime is 6 days, but a handful only have uptime 
> of a day)
>
> I've started debugging this awhile ago, and the gist of the logs is here: 
> https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted this back in 
> October seeking help and Benjamin suggested network issues in both 
> directions, so I thought firewall.
>
> Thanks for any hints,
> Jeremy
>
> From: tommy xiao <xia...@gmail.com<mailto:xia...@gmail.com>>
> Sent: Tuesday, November 10, 2015 3:07 AM
>
> To: user@mesos.apache.org<mailto:user@mesos.apache.org>
> Subject: Re: Mesos and Zookeeper TCP keepalive
>
> same here , same question with Erik. could you please input more background 
> info, thanks
>
> 2015-11-10 15:56 GMT+08:00 Erik Weathers 
> <eweath...@groupon.com<mailto:eweath...@groupon.com>>:
> It would really help if you (Jeremy) explained the *actual* problem you are 
> facing.  I'm *guessing* that it's a firewall timing out the sessions because 
> there isn't activity on them for whatever the timeout of the firewall is?   
> It seems likely to be unreasonably short, given that mesos has constant 
> activity between master and 
> slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals.
>
> - Erik
>
> On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese 
> <j...@mesosphere.io<mailto:j...@mesosphere.io>> wrote:
> Hi Jeremy
>  Its great that you are making progress but I doubt if this is what you 
> intend to achieve since network failures are a valid state in distributed 
> systems. If you think there is a special case you are trying to solve, I 
> suggest proposing a design document for review.
>   For ZK client code, I would suggest asking the zookeeper mailing list.
>
> thanks
> -Jojy
>
>> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa 
>> <jol...@spscommerce.com<mailto:jol...@spscommerce.com>> wrote:
>>
>> Alright, great, I'm making some progress,
>>
>> I did a simple copy/paste modification and recompiled mesos. The keepalive 
>> timer is set from slave to master so this is an improvement for me. I didn't 
>> test the other direction yet - 
>> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file an 
>> enhancement request for this since it seems like an improvement for other 
>> people as well, after some real world testing
>>
>> I'm having some harder time figuring out the zk client code. I started by 
>> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my 
>> change wasn't correct or b) I'm modifying a wrong file, since I just assumed 
>> using the c client. Is this the correct place?
>>
>> Thanks much,
>> Jeremy
>>
>>
>> From: Jojy Varghese <j...@mesosphere.io<mailto:j...@mesosphere.io>>
>> Sent: Monday, November 9, 2015 2:09 PM
>> To: user@mesos.apache.org<mailto:user@mesos.apache.org>
>> Subject: Re: Mesos and Zookeeper TCP keepalive
>>
>> Hi Jeremy
>>  The “network” code is at "3rdparty/libprocess/include/process/network.hpp” 
>> , "3rdparty/libprocess/src/poll_socket.hpp/cpp”.
>>
>> thanks
>> jojy
>>
>>
>>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa 
>>> <jol...@spscommerce.com<mailto:jol...@spscommerce.com>> wrote:
>>>
>>> Hi all,
>>>
>>> Jojy, That is correct, but more specifically a keepalive timer from slave 
>>> to master and slave to zookeeper. Can you send a link to the portion of the 
>>> code that builds the socket/connection? Is there any reason to not set the 
>>> SO_KEEPALIVE option in your opinion?
>>>
>>> hasodent, I'm not looking for keepalive between zk quorum members, like the 
>>> ZOOKEEPER JIRA is referencing.
>>>
>>> Thanks,
>>> Jeremy
>>>
>>>
>>> From: Jojy Varghese <j...@mesosphere.io<mailto:j...@mesosphere.io>>
>>> Sent: Sunday, November 8, 2015 8:37 PM
>>> To: user@mesos.apache.org<mailto:user@mesos.apache.org>
>>> Subject: Re: Mesos and Zookeeper TCP keepalive
>>>
>>> Hi Jeremy
>>>   Are you trying to establish a keepalive timer between mesos master and 
>>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE 
>>> option is  not set on an accepting socket.
>>>
>>> -Jojy
>>>
>>>> On Nov 8, 2015, at 8:43 AM, haosdent 
>>>> <haosd...@gmail.com<mailto:haosd...@gmail.com>> wrote:
>>>>
>>>> I think keepalive option should be set in Zookeeper, not in Mesos. See 
>>>> this related issue in Zookeeper. 
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>>>>
>>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa 
>>>> <jol...@spscommerce.com<mailto:jol...@spscommerce.com>> wrote:
>>>> Hello all,
>>>>
>>>> We have been fighting some network/session disconnection issues between 
>>>> datacenters and I'm curious if there is anyway to enable tcp keepalive on 
>>>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp 
>>>> kernel settings would be used. I believe keepalive has to be enabled by 
>>>> the software which is opening the connection. (That is my understanding 
>>>> anyway)
>>>>
>>>> Here is what I see via netstat --timers -tn:
>>>> tcp        0      0 172.18.1.1:55842<http://172.18.1.1:55842/>      
>>>> 10.10.1.1:2181<http://10.10.1.1:2181/>      ESTABLISHED off (0.00/0/0)
>>>> tcp        0      0 172.18.1.1:49702<http://172.18.1.1:49702/>      
>>>> 10.10.1.1:5050<http://10.10.1.1:5050/>      ESTABLISHED off (0.00/0/0)
>>>>
>>>>
>>>> Where 172 is the mesos-slave network and 10 is the mesos-master network. 
>>>> The "off" keyword means that keepalive's are not being sent.
>>>>
>>>> I've trolled through JIRA, git, etc and cannot easily determine if this is 
>>>> expected behavior or should be an enhancement request. Any ideas?
>>>>
>>>> Thanks much!
>>>> -Jeremy
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>
>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com<http://gmail.com/>




--
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com<http://gmail.com/>




--
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com<http://gmail.com>

Re: Mesos and Zookeeper TCP keepalive

Reply via email to