[ 
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163320#comment-13163320
 ] 

Peter Schuller commented on CASSANDRA-3569:
-------------------------------------------

{quote}
socket-based failure detection is prone to false negatives, e.g., I remember 
talking to the Twitter guys in the context of CASSANDRA-3005 about connections 
that appeared to be alive but made no progress
{quote}

This is expected behavior with TCP connections. Unless you use TCP keep-alive, 
or use a timeout on individual I/O operations, or have a socket timeout set, 
TCP connections *will* hang forever under certain circumstances (most typical 
being a stateful firewall in between dropping state, or a host you're talking 
to suddenly panicing). It is expected behavior, not a weird unexplained bug, so 
it shouldn't be taken as an indication that TCP is broken.

{quote}
socket-based failure detection is also prone to false positives, in the sense 
that a transient network failure shouldn't insta-fail a streaming operation. 
Trying to strike the right balance between retrying "enough, but not too much" 
would basically be reinventing the FD IMO.
{quote}

If anything it's the other way around. With a 5 minute keep alive/timeout 
trigger, we'd survive LOTS longer than the failure detector. It's just about 
using the appropriate settings. Normally I would just say "set a socket 
timeout" and we'd be done, but the problem with that is that it *will* cause 
false positives unless we actively ping/pong, whenever there is a situation 
where one end is blocking for an extended period of time.

Keep-alive on the other hand does not necessitate code changes other than 
setting the flag, and is a very basic feature provided by the OS. I agree it's 
bad that the default settings mean that you may need to tweak the OS, but even 
at default settings, it's not like it sets there forever. A couple of hours 
doesn't seem terribly bad to me. Especially not compared to the cost of 
incorrectly slaying a perfectly working streaming repair (the OP's problem on 
the mailinglist where he then ran out of space and was in an even bigger pickle 
is a good example). Mucked up repairs can be a huge issue on large production 
clusters with lots of data. In contrast, a TCP connection being stuck for a 
couple of hours seems totally minor in comparison.

{quote}
I also note that while I agree with "just fix it at the OS level" in principle, 
we already have a higher bar than average for sysadmin kung-fu. Other things 
being equal, I'd like to work as well as possible even in the face of an OS 
running with default tuning, i.e., almost every cluster in the wild.
{quote}

IMO the default behavior with keep-alive and default ~ few hours tear-down, 
seems much better than dropping the connection whenever the FD has a hiccup. 
Especially given that the FD will tend to hiccup under circumstances when you 
are *extra* in need of streaming not breaking (although I suppose this behavior 
is limited to the anti-entropy service right now so at least it doesn't cause 
havoc with e.g. bootstrap). I.e., I see this as a change to decrease the 
potential for food shooting and help the admin, not the other way around.

                
> Failure detector downs should not break streams
> -----------------------------------------------
>
>                 Key: CASSANDRA-3569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit 
> there waiting forever. In my opinion the correct fix to that problem is to 
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high 
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a 
> communication method which is the TCP transport, that we know is used for 
> long-running processes that you don't want to incorrectly be killed for no 
> good reason, and we are using a failure detector tuned to detecting when not 
> to send real-time sensitive request to nodes in order to actively kill a 
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I 
> propose that we simply just use TCP keep alive for streaming connections and 
> instruct operators of production clusters to tweak 
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent 
> on their OS).
> I can submit the patch. Awaiting opinions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to