[ 
https://issues.apache.org/jira/browse/CASSANDRA-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947565#comment-14947565
 ] 

Jason Brown commented on CASSANDRA-10244:
-----------------------------------------

My proposal, nor any algorithm for that matter, cannot solve the worst case 
behavior - send a request and wait for it to time out, as it is impossible to 
know when a peer will become unavailable, or that packet gets lost, and so on. 
However, I believe my proposal will perform better in the average case as we 
use our own local knowledge of availability (based upon responses to previous 
requests) to judge whether we send the request to a peer (or select another 
peer or hint) - and not just use the transitive availability that the heartbeat 
provides. While the paper doesn't quite get the details correct about why the 
gossip messages (which carry the heartbeat) aren't propagating as expected, 
they do correctly identify the absence of *continuous and steady* propagation 
of the heartbeats causes the FailureDetector to get unhappy and mark the peer 
UP, then DOWN, then UP, then DOWN, then...


> Replace heartbeats with locally recorded metrics for failure detection
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-10244
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10244
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>
> In the current implementation, the primary purpose of sending gossip messages 
> is for delivering the updated heartbeat values of each node in a cluster. The 
> other data that is passed in gossip (node metadata such as status, dc, rack, 
> tokens, and so on) changes very infrequently (or rarely), such that the 
> eventual delivery of that data is entirely reasonable. Heartbeats, however, 
> are quite different. A continuous and nearly consistent delivery time of 
> updated heartbeats is critical for the stability of a cluster. It is through 
> the receipt of the updated heartbeat that a node determines the reachability 
> (UP/DOWN status) of all peers in the cluster. The current implementation of 
> FailureDetector measures the time differences between the heartbeat updates 
> received about a peer (Note: I said about a peer, not from the peer directly, 
> as those values are disseminated via gossip). Without a consistent time 
> delivery of those updates, the FD, via it's use of the PHI-accrual 
> algorigthm, will mark the peer as DOWN (unreachable). The two nodes could be 
> sending all other traffic without problem, but if the heartbeats are not 
> propagated correctly, each of the nodes will mark the other as DOWN, which is 
> clearly suboptimal to cluster health. Further, heartbeat updates are the only 
> mechanism we use to determine reachability (UP/DOWN) of a peer; dynamic 
> snitch measurements, for example, are not included in the determination. 
> To illustrate this, in the current implementation, assume a cluster of nodes: 
> A, B, and C. A partition starts between nodes A and C (no communication 
> succeeds), but both nodes can communicate with B. As B will get the updated 
> heartbeats from both A and C, it will, via gossip, send those over to the 
> other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, due 
> to the partition between them, all communication between A and C will fail, 
> yet neither node will mark the other as down because each is receiving, 
> transitively via B, the updated heartbeat about the other. While it's true 
> that the other node is alive, only having transitive knowledge about a peer, 
> and allowing that to be the sole determinant of UP/DOWN reachability status, 
> is not sufficient for a correct and effieicently operating cluster. 
> This transitive availability is suboptimal, and I propose we drop the 
> heartbeat concept altogether. Instead, the dynamic snitch should become more 
> intelligent, and it's measurements ultimately become the input for 
> determining the reachability status of each peer(as fed into a revamped FD). 
> As we already capture latencies in the dsntich, we can reasonably extend it 
> to include timeouts/missed responses, and make that the basis for the UP/DOWN 
> decisioning. Thus we will have more accurate and relevant peer statueses that 
> is tailored to the local node.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to