[ https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987023#comment-13987023 ]
Joshua McKenzie commented on CASSANDRA-3569: -------------------------------------------- nio isn't interested in us getting the integer file descriptor for the underlying sockets without cracking open some internals and using reflection to rip out private member variables. I don't think this is the right way to go - it'll be messy (reflecting out 2 private members from SocketAdapter down), java platform-dependent, and brittle. As for using sysctl or modifying the registry (Windows) on cassandra start - that isn't the least surprising thing we could do as there would be side effects to other processes running on these machines. Do we have a precedent at this time for changing global system configuration settings on startup of the daemon or during rpm install? Maybe adding an optional parameter in the yaml for tcp_keepalive_interval and selectively setting that if the users opt-in. Still seems like it doesn't address our need of a default-state with improved behavior though. > Failure detector downs should not break streams > ----------------------------------------------- > > Key: CASSANDRA-3569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3569 > Project: Cassandra > Issue Type: New Feature > Reporter: Peter Schuller > Assignee: Joshua McKenzie > Fix For: 2.1 rc1 > > Attachments: 3569-2.0.txt > > > CASSANDRA-2433 introduced this behavior just to get repairs to don't sit > there waiting forever. In my opinion the correct fix to that problem is to > use TCP keep alive. Unfortunately the TCP keep alive period is insanely high > by default on a modern Linux, so just doing that is not entirely good either. > But using the failure detector seems non-sensicle to me. We have a > communication method which is the TCP transport, that we know is used for > long-running processes that you don't want to incorrectly be killed for no > good reason, and we are using a failure detector tuned to detecting when not > to send real-time sensitive request to nodes in order to actively kill a > working connection. > So, rather than add complexity with protocol based ping/pongs and such, I > propose that we simply just use TCP keep alive for streaming connections and > instruct operators of production clusters to tweak > net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent > on their OS). > I can submit the patch. Awaiting opinions. -- This message was sent by Atlassian JIRA (v6.2#6252)