[ https://issues.apache.org/jira/browse/CASSANDRA-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640394#comment-13640394 ]
Jonathan Ellis commented on CASSANDRA-5509: ------------------------------------------- Up until 0.6, this is how hinted handoff worked. The problem was that one node going down would tend to morph into a cascading failure storm, since we're maintaining the number of copies written with less machines to do it with. We have more robust code now to reject writes with UAE if hints are getting backed up, so this is not necessarily a -1, but it's something to be cautious about. > Decouple Consistency & Durability > --------------------------------- > > Key: CASSANDRA-5509 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5509 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Rick Branson > > Right now in Cassandra, consistency and durability are intertwined in a way > that is unnecessary. In environments where nodes have unreliable local > storage, the consistency level of writes must be increased to N+1 to ensure > that N host failure(s) don't cause data loss, even if it's acceptable that > consistency is weaker. The probability of data loss is also heavily > influenced by entropy. An example is if the client chooses a replica as the > write coordinator for a CL.ONE write, the risk of losing that data increases > substantially. During a node outage, the chance of data loss is elevated for > a relatively long time: the entire length of the node outage + recovery time. > The required increase in consistency level has real impact: it creates the > potential for availability issues during routine maintenance as an unlucky > node failure can cause writes to start failing. It's also generally > considered a best practice that each datacenter has at least 3 replicas of > data, even if quorums for consistency are not required, as it's the only way > to ensure strong durability in the face of transient inter-DC failures. > I found a relevant paper that provides some theoretical grounding while > researching: http://www.cs.ubc.ca/~geoffrey/papers/durability-sigops04.pdf > I'd like to propose that in the event of a down replica, the coordinator > attempts to achieve RF by distributing "remote hints" to RF-liveReplicaCount > non-replica nodes. If the coordinator itself is a non-replica, it would be an > acceptable choice for a remote hint as well. This would achieve RF level > durability without the availability penalty of increasing consistency. This > would also allow decreasing the (global) RF, as RF durability goals could > still be achieved during transient inter-DC failures, requiring just RF nodes > in each DC, instead of RF replicas in each DC. Even better would be if the > selection of remote hint nodes respected the replication strategy and was > able to achieve the cross-rack / cross-DC durability. > While ConsistencyLevel is a pretty overloaded concept at this point, and I > think it'd be great to add a DurabilityLevel to each write, I understand that > this is likely not pragmatic. Therefore, considering that the CL.TWO and > CL.THREE options were added largely for durability reasons, I propose that > they be repurposed to support durability goals and remote hinting. They would > require 1 replica ACK and CL-1 (replica|hint) ACKs. It also might be > desirable to extend the "ANY" option to require multiple hint ACKs, such as > CL.ANY_TWO or CL.ANY_THREE, which would support combined very high durability > and very high availability. All CLs will benefit as remote hinting vastly > tightens the window of elevated data loss chance during a node outage from > nodeOutageDuration + recoveryDuration to the time it takes for the > coordinator to distribute remote hints. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira