Rick Branson created CASSANDRA-5509:
---------------------------------------

             Summary: Decouple Consistency & Durability
                 Key: CASSANDRA-5509
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5509
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
            Reporter: Rick Branson


Right now in Cassandra, consistency and durability are intertwined in a way 
that is unnecessary. In environments where nodes have unreliable local storage, 
the consistency level of writes must be increased to N+1 to ensure that N host 
failure(s) don't cause data loss, even if it's acceptable that consistency is 
weaker. The probability of data loss is also heavily influenced by entropy. An 
example is if the client chooses a replica as the write coordinator for a 
CL.ONE write, the risk of losing that data increases substantially. During a 
node outage, the chance of data loss is elevated for a relatively long time: 
the entire length of the node outage + recovery time. The required increase in 
consistency level has real impact: it creates the potential for availability 
issues during routine maintenance as an unlucky node failure can cause writes 
to start failing. It's also generally considered a best practice that each 
datacenter has at least 3 replicas of data, even if quorums for consistency are 
not required, as it's the only way to ensure strong durability in the face of 
transient inter-DC failures.

I found a relevant paper that provides some theoretical grounding while 
researching: http://www.cs.ubc.ca/~geoffrey/papers/durability-sigops04.pdf

I'd like to propose that in the event of a down replica, the coordinator 
attempts to achieve RF by distributing "remote hints" to RF-liveReplicaCount 
non-replica nodes. If the coordinator itself is a non-replica, it would be an 
acceptable choice for a remote hint as well. This would achieve RF level 
durability without the availability penalty of increasing consistency. This 
would also allow decreasing the (global) RF, as RF durability goals could still 
be achieved during transient inter-DC failures, requiring just RF nodes in each 
DC, instead of RF replicas in each DC. Even better would be if the selection of 
remote hint nodes respected the replication strategy and was able to achieve 
the cross-rack / cross-DC durability.

While ConsistencyLevel is a pretty overloaded concept at this point, and I 
think it'd be great to add a DurabilityLevel to each write, I understand that 
this is likely not pragmatic. Therefore, considering that the CL.TWO and 
CL.THREE options were added largely for durability reasons, I propose that they 
be repurposed to support durability goals and remote hinting. They would 
require 1 replica ACK and CL-1 (replica|hint) ACKs. It also might be desirable 
to extend the "ANY" option to require multiple hint ACKs, such as CL.ANY_TWO or 
CL.ANY_THREE, which would support combined very high durability and very high 
availability. All CLs will benefit as remote hinting vastly tightens the window 
of elevated data loss chance during a node outage from nodeOutageDuration + 
recoveryDuration to the time it takes for the coordinator to distribute remote 
hints.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to