[ https://issues.apache.org/jira/browse/CASSANDRA-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217235#comment-13217235 ]
Brandon Williams commented on CASSANDRA-3294: --------------------------------------------- bq. How about we assign probability "to be alive" to each of the nodes in the ring This sounds like reinventing the existing failure detector to me. > a node whose TCP connection is not up should be considered down for the > purpose of reads and writes > --------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-3294 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3294 > Project: Cassandra > Issue Type: Improvement > Reporter: Peter Schuller > Assignee: Peter Schuller > > Cassandra fails to handle the most simple of cases intelligently - a process > gets killed and the TCP connection dies. I cannot see a good reason to wait > for a bunch of RPC timeouts and thousands of hung requests to realize that we > shouldn't be sending messages to a node when the only possible means of > communication is confirmed down. This is why one has to "disablegossip and > wait for a while" to restar a node on a busy cluster (especially without > CASSANDRA-2540 but that only helps under certain circumstances). > A more generalized approach where by one e.g. weights in the number of > currently outstanding RPC requests to a node, would likely take care of this > case as well. But until such a thing exists and works well, it seems prudent > to have the very common and controlled form of "failure" be handled better. > Are there difficulties I'm not seeing? > I can see that one may want to distinguish between considering something > "really down" (and e.g. fail a repair because it's down) from what I'm > talking about, so maybe there are different concepts (say one is "currently > unreachable" rather than "down") being conflated. But in the specific case of > sending reads/writes to a node we *know* we cannot talk to, it seems > unnecessarily detrimental. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira