same here , same question with Erik. could you please input more background info, thanks
2015-11-10 15:56 GMT+08:00 Erik Weathers <eweath...@groupon.com>: > It would really help if you (Jeremy) explained the *actual* problem you > are facing. I'm *guessing* that it's a firewall timing out the sessions > because there isn't activity on them for whatever the timeout of the > firewall is? It seems likely to be unreasonably short, given that mesos > has constant activity between master and > slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals. > > - Erik > > On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io> wrote: > >> Hi Jeremy >> Its great that you are making progress but I doubt if this is what you >> intend to achieve since network failures are a valid state in distributed >> systems. If you think there is a special case you are trying to solve, I >> suggest proposing a design document for review. >> For ZK client code, I would suggest asking the zookeeper mailing list. >> >> thanks >> -Jojy >> >> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> wrote: >> >> Alright, great, I'm making some progress, >> >> I did a simple copy/paste modification and recompiled mesos. The >> keepalive timer is set from slave to master so this is an improvement for >> me. I didn't test the other direction yet - >> https://gist.github.com/jolexa/ee9e152aa7045c558e02 - I'd like to file >> an enhancement request for this since it seems like an improvement for >> other people as well, after some real world testing >> >> I'm having some harder time figuring out the zk client code. I started by >> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my >> change wasn't correct or b) I'm modifying a wrong file, since I >> just assumed using the c client. Is this the correct place? >> >> Thanks much, >> Jeremy >> >> >> ------------------------------ >> *From:* Jojy Varghese <j...@mesosphere.io> >> *Sent:* Monday, November 9, 2015 2:09 PM >> *To:* user@mesos.apache.org >> *Subject:* Re: Mesos and Zookeeper TCP keepalive >> >> Hi Jeremy >> The “network” code is at >> "3rdparty/libprocess/include/process/network.hpp” , >> "3rdparty/libprocess/src/poll_socket.hpp/cpp”. >> >> thanks >> jojy >> >> >> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com> wrote: >> >> Hi all, >> >> Jojy, That is correct, but more specifically a keepalive timer from slave >> to master and slave to zookeeper. Can you send a link to the portion of the >> code that builds the socket/connection? Is there any reason to not set the >> SO_KEEPALIVE option in your opinion? >> >> hasodent, I'm not looking for keepalive between zk quorum members, like >> the ZOOKEEPER JIRA is referencing. >> >> Thanks, >> Jeremy >> >> >> ------------------------------ >> *From:* Jojy Varghese <j...@mesosphere.io> >> *Sent:* Sunday, November 8, 2015 8:37 PM >> *To:* user@mesos.apache.org >> *Subject:* Re: Mesos and Zookeeper TCP keepalive >> >> Hi Jeremy >> Are you trying to establish a keepalive timer between mesos master and >> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE >> option is not set on an accepting socket. >> >> -Jojy >> >> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com> wrote: >> >> I think keepalive option should be set in Zookeeper, not in Mesos. See >> this related issue in Zookeeper. >> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085 >> >> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com> >> wrote: >> >>> Hello all, >>> >>> We have been fighting some network/session disconnection issues between >>> datacenters and I'm curious if there is anyway to enable tcp keepalive on >>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp >>> kernel settings would be used. I believe keepalive has to be enabled by the >>> software which is opening the connection. (That is my understanding anyway) >>> >>> Here is what I see via netstat --timers -tn: >>> tcp 0 0 172.18.1.1:55842 10.10.1.1:2181 >>> ESTABLISHED off (0.00/0/0) >>> tcp 0 0 172.18.1.1:49702 10.10.1.1:5050 >>> ESTABLISHED off (0.00/0/0) >>> >>> >>> Where 172 is the mesos-slave network and 10 is the mesos-master network. >>> The "off" keyword means that keepalive's are not being sent. >>> >>> I've trolled through JIRA, git, etc and cannot easily determine if this >>> is expected behavior or should be an enhancement request. Any ideas? >>> >>> Thanks much! >>> -Jeremy >>> >>> >> >> >> -- >> Best Regards, >> Haosdent Huang >> >> >> > -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com