[ https://issues.apache.org/jira/browse/CASSANDRA-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210854#comment-13210854 ]
Peter Schuller commented on CASSANDRA-3910: ------------------------------------------- So a phi of 8-9 or so should result in a down after something like 10+ seconds (off the top of my head; but point is, "several seconds") assuming gossip delays are dominating the heartbeat propagation (as opposed to networking issues, in which case it would take longer). Questions: * If you just have these two nodes sit idle w/o traffic, do you see hosts being kicked into down state spontaneously? ** If yes, something is either buggy or your network conditions are *extremely* poor. I presume the answer to this is "no", but I wanted to ask to be sure. * Under traffic conditions when you are observing this flapping, how much data are you pushing between these two nodes? Are you throwing traffic "as fast as possible" (un-throttled benchmark client which isn't bottlenecking) or at some pre-set pace? What is the actual bandwidth and how does it relate to the expected throughput on a TCP connection between the two nodes? I am mostly trying to confirm what's going on. It sounds to me like you're likely shoving more down that TCP pipe than you can reliably sustain on average, and in the event of a hiccup on the TCP connection, you're pushing enough traffic that gets queued that the delay in gossip is just due to the time it takes to catch up with the requests. It strikes me that "invalid" downs due to this would be most effectively solved by having Gossip messages be prioritized (or even just on a separate connection, but that would be more work patch wise) when enqueued on the TCP connection. If they always are prioritized, you wouldn't see delays in gossip messages other than due to networking conditions so bad that not even that tiny bit of information is making it through. (This only makes sense though if you don't expect the failure detector to help with congestion.) But, since you're also trying to use the FD to avoid queueing up messages, it doesn't actually solve *your* problem. And if you didn't care about that, you could just up the phi conviction threshold even more until you don't see flapping. That's assuming the overall average bandwidth is high enough to sustain your traffic pattern. > make phi_convict_threshold Float > -------------------------------- > > Key: CASSANDRA-3910 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3910 > Project: Cassandra > Issue Type: Improvement > Components: Core > Affects Versions: 1.0.7 > Reporter: Radim Kolar > > I would like to have phi_convict_threshold floating point number instead of > integer. Value 8 is too low for me and value 9 is too high. With converting > to floating point, it can be better fine tuned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira