[jira] [Commented] (CASSANDRA-3273) FailureDetector can take a very long time to mark a host down

Brandon Williams (Commented) (JIRA) Thu, 29 Sep 2011 11:14:09 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13117490#comment-13117490
 ]


Brandon Williams commented on CASSANDRA-3273:
---------------------------------------------

bq. I thought only gossip heartbeats generate interval measurements, is that 
incorrect?

Heartbeats and generation changes.  I take back what I said though, it's not 
the versioning reconnection, and it's not a problem with regard to making the 
FD take a long time to mark a host down.

It is, however, possible to receive two intervals in a short amount of time, 
just due to timer skew between the two hosts, but it can only happen once since 
after that they will be in sync from the FD's perspective.

The net effect of this in the pathological case would be that the FD causes a 
host to be marked down if the host suddenly becomes silent for a period of 4-5s 
after the FD receives the initial (500ms) interval and then the short (1ms) one 
only.  


                
> FailureDetector can take a very long time to mark a host down
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-3273
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3273
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>
> There are two ways to trigger this:
> * Bring a node up very briefly in a mixed-version cluster and then terminate 
> it
> * Bring a node up, terminate it for a very long time, then bring it back up 
> and take it down again
> In the first case, what can happen is a very short interval arrival time is 
> recorded by the versioning logic which requires reconnecting and can happen 
> very quickly. This can easily be solved by rejecting any intervals within a 
> reasonable bound, for instance the gossiper interval.
> The second instance is harder to solve, because what is happening is that an 
> extremely large interval is recorded, which is the time the node was left 
> dead the first time.  This throws off the mean of the intervals and causes it 
> to take a much longer time than it should to mark it down the second time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3273) FailureDetector can take a very long time to mark a host down

Reply via email to