[ 
https://issues.apache.org/jira/browse/CASSANDRA-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210854#comment-13210854
 ] 

Peter Schuller commented on CASSANDRA-3910:
-------------------------------------------

So a phi of 8-9 or so should result in a down after something like 10+ seconds 
(off the top of my head; but point is, "several seconds") assuming gossip 
delays are dominating the heartbeat propagation (as opposed to networking 
issues, in which case it would take longer).

Questions:

* If you just have these two nodes sit idle w/o traffic, do you see hosts being 
kicked into down state spontaneously?
** If yes, something is either buggy or your network conditions are *extremely* 
poor. I presume the answer to this is "no", but I wanted to ask to be sure.

* Under traffic conditions when you are observing this flapping, how much data 
are you pushing between these two nodes? Are you throwing traffic "as fast as 
possible" (un-throttled benchmark client which isn't bottlenecking) or at some 
pre-set pace? What is the actual bandwidth and how does it relate to the 
expected throughput on a TCP connection between the two nodes?

I am mostly trying to confirm what's going on. It sounds to me like you're 
likely shoving more down that TCP pipe than you can reliably sustain on 
average, and in the event of a hiccup on the TCP connection, you're pushing 
enough traffic that gets queued that the delay in gossip is just due to the 
time it takes to catch up with the requests.

It strikes me that "invalid" downs due to this would be most effectively solved 
by having Gossip messages be prioritized (or even just on a separate 
connection, but that would be more work patch wise) when enqueued on the TCP 
connection. If they always are prioritized, you wouldn't see delays in gossip 
messages other than due to networking conditions so bad that not even that tiny 
bit of information is making it through. (This only makes sense though if you 
don't expect the failure detector to help with congestion.)

But, since you're also trying to use the FD to avoid queueing up messages, it 
doesn't actually solve *your* problem. And if you didn't care about that, you 
could just up the phi conviction threshold even more until you don't see 
flapping. That's assuming the overall average bandwidth is high enough to 
sustain your traffic pattern.

                
> make phi_convict_threshold Float
> --------------------------------
>
>                 Key: CASSANDRA-3910
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3910
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.0.7
>            Reporter: Radim Kolar
>
> I would like to have phi_convict_threshold floating point number instead of 
> integer. Value 8 is too low for me and value 9 is too high. With converting 
> to floating point, it can be better fine tuned.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to