[ 
https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217235#comment-13217235
 ] 

Brandon Williams commented on CASSANDRA-3294:
---------------------------------------------

bq. How about we assign probability "to be alive" to each of the nodes in the 
ring

This sounds like reinventing the existing failure detector to me.
                
> a node whose TCP connection is not up should be considered down for the 
> purpose of reads and writes
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3294
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>
> Cassandra fails to handle the most simple of cases intelligently - a process 
> gets killed and the TCP connection dies. I cannot see a good reason to wait 
> for a bunch of RPC timeouts and thousands of hung requests to realize that we 
> shouldn't be sending messages to a node when the only possible means of 
> communication is confirmed down. This is why one has to "disablegossip and 
> wait for a while" to restar a node on a busy cluster (especially without 
> CASSANDRA-2540 but that only helps under certain circumstances).
> A more generalized approach where by one e.g. weights in the number of 
> currently outstanding RPC requests to a node, would likely take care of this 
> case as well. But until such a thing exists and works well, it seems prudent 
> to have the very common and controlled form of "failure" be handled better.
> Are there difficulties I'm not seeing?
> I can see that one may want to distinguish between considering something 
> "really down" (and e.g. fail a repair because it's down) from what I'm 
> talking about, so maybe there are different concepts (say one is "currently 
> unreachable" rather than "down") being conflated. But in the specific case of 
> sending reads/writes to a node we *know* we cannot talk to, it seems 
> unnecessarily detrimental.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to