a node whose TCP connection is not up should be considered down for the purpose 
of reads and writes
---------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-3294
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3294
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Peter Schuller


Cassandra fails to handle the most simple of cases intelligently - a process 
gets killed and the TCP connection dies. I cannot see a good reason to wait for 
a bunch of RPC timeouts and thousands of hung requests to realize that we 
shouldn't be sending messages to a node when the only possible means of 
communication is confirmed down. This is why one has to "disablegossip and wait 
for a while" to restar a node on a busy cluster (especially without 
CASSANDRA-2540 but that only helps under certain circumstances).

A more generalized approach where by one e.g. weights in the number of 
currently outstanding RPC requests to a node, would likely take care of this 
case as well. But until such a thing exists and works well, it seems prudent to 
have the very common and controlled form of "failure" be handled better.

Are there difficulties I'm not seeing?

I can see that one may want to distinguish between considering something 
"really down" (and e.g. fail a repair because it's down) from what I'm talking 
about, so maybe there are different concepts (say one is "currently 
unreachable" rather than "down") being conflated. But in the specific case of 
sending reads/writes to a node we *know* we cannot talk to, it seems 
unnecessarily detrimental.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to