It would really help if you (Jeremy) explained the *actual* problem you are
facing.  I'm *guessing* that it's a firewall timing out the sessions
because there isn't activity on them for whatever the timeout of the
firewall is?   It seems likely to be unreasonably short, given that mesos
has constant activity between master and
slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals.

- Erik

On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io> wrote:

> Hi Jeremy
>  Its great that you are making progress but I doubt if this is what you
> intend to achieve since network failures are a valid state in distributed
> systems. If you think there is a special case you are trying to solve, I
> suggest proposing a design document for review.
>   For ZK client code, I would suggest asking the zookeeper mailing list.
>
> thanks
> -Jojy
>
> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> wrote:
>
> Alright, great, I'm making some progress,
>
> I did a simple copy/paste modification and recompiled mesos. The keepalive
> timer is set from slave to master so this is an improvement for me. I
> didn't test the other direction yet -
> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file an
> enhancement request for this since it seems like an improvement for other
> people as well, after some real world testing
>
> I'm having some harder time figuring out the zk client code. I started by
> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my
> change wasn't correct or b) I'm modifying a wrong file, since I
> just assumed using the c client. Is this the correct place?
>
> Thanks much,
> Jeremy
>
>
> ------------------------------
> *From:* Jojy Varghese <j...@mesosphere.io>
> *Sent:* Monday, November 9, 2015 2:09 PM
> *To:* user@mesos.apache.org
> *Subject:* Re: Mesos and Zookeeper TCP keepalive
>
> Hi Jeremy
>  The “network” code is at
> "3rdparty/libprocess/include/process/network.hpp” ,
> "3rdparty/libprocess/src/poll_socket.hpp/cpp”.
>
> thanks
> jojy
>
>
> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com> wrote:
>
> Hi all,
>
> Jojy, That is correct, but more specifically a keepalive timer from slave
> to master and slave to zookeeper. Can you send a link to the portion of the
> code that builds the socket/connection? Is there any reason to not set the
> SO_KEEPALIVE option in your opinion?
>
> hasodent, I'm not looking for keepalive between zk quorum members, like
> the ZOOKEEPER JIRA is referencing.
>
> Thanks,
> Jeremy
>
>
> ------------------------------
> *From:* Jojy Varghese <j...@mesosphere.io>
> *Sent:* Sunday, November 8, 2015 8:37 PM
> *To:* user@mesos.apache.org
> *Subject:* Re: Mesos and Zookeeper TCP keepalive
>
> Hi Jeremy
>   Are you trying to establish a keepalive timer between mesos master and
> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE
> option is  not set on an accepting socket.
>
> -Jojy
>
> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com> wrote:
>
> I think keepalive option should be set in Zookeeper, not in Mesos. See
> this related issue in Zookeeper.
> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>
> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com>
> wrote:
>
>> Hello all,
>>
>> We have been fighting some network/session disconnection issues between
>> datacenters and I'm curious if there is anyway to enable tcp keepalive on
>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp
>> kernel settings would be used. I believe keepalive has to be enabled by the
>> software which is opening the connection. (That is my understanding anyway)
>>
>> Here is what I see via netstat --timers -tn:
>> tcp        0      0 172.18.1.1:55842      10.10.1.1:2181
>>  ESTABLISHED off (0.00/0/0)
>> tcp        0      0 172.18.1.1:49702      10.10.1.1:5050
>>  ESTABLISHED off (0.00/0/0)
>>
>>
>> Where 172 is the mesos-slave network and 10 is the mesos-master network.
>> The "off" keyword means that keepalive's are not being sent.
>>
>> I've trolled through JIRA, git, etc and cannot easily determine if this
>> is expected behavior or should be an enhancement request. Any ideas?
>>
>> Thanks much!
>> -Jeremy
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>

Reply via email to