[jira] [Comment Edited] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

Benedict (JIRA) Tue, 24 Oct 2017 02:36:18 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215271#comment-16215271
 ]


Benedict edited comment on CASSANDRA-13442 at 10/24/17 9:35 AM:
----------------------------------------------------------------

This is a cool feature idea, so I hope you don’t mind my 2c.  I’m generally +1 
the idea, if done carefully.  Honestly, it could outright replace the standard 
approach for multi-DC clusters.

h4. Mental Model

It might help to think in terms of “2+1” instead of “3-1”

* 3-1 => delete one RF when “safe” to do so
* 2+1 => transiently boost RF
 
This seems an easier mental model, and suggests the following rule to fairly 
straightforwardly specify the behaviour:

*Any R/W operation on N (out of RF) replicas, containing at least one full 
replica serving a non-digest request, can be safely boosted to N+1 (out of 
RF+1) by the inclusion of another transient replica.*

This means that every operation must have a participating full replica before 
we can consider a transient replica.  This suggests we probably shouldn’t treat 
the transient replicas as part of the normal ring, instead having a 
transient-ring we can opt into consulting.

If we follow this change of approach through to the act of implementation, 
including a separate ring, we can probably more safely deliver this 
incrementally, rather than in one giant patchset.

I think this also consistently handles the various perceived complexities 
around handling CL.ONE etc.

h4. Implementation Details

I would suggest it may be easiest to achieve using sstables.  A very simple and 
safe approach to start could require Cheap Quorums to minimise transient 
replica sstables, and unconditionally stream their entirety to the owning 
replicas before immediately dropping them.  There are some obvious simple 
advances on this that could be followed up with, and we could take our time 
with a non-cheap quorum approach (that doesn’t immediately seem as trivial).

h4. MVs

bq. Transient replicas based on incremental repair aren't going to work with 2i 
and MV because there is no efficient way to anti-compact them since they aren't 
stored in token order.

It’s possible I’m missing an implied fundamental difference between the hint 
and repair approach, but it seems to me that transient replicas cannot in any 
scenario have their own paired MV, since they cannot generate MV deltas.  So, 
all we can do is ensure eventual delivery of updates to a full replica, which 
will then update its paired MV.

The real problem seems that we cannot easily support any RF boost to 2i or MV 
reads, in any implementation.  At least, not without routing the update through 
at least one full replica before reaching the transient replica.

This doesn’t seem to prevent their use, just that the RF remains at (e.g.) 2, 
and cannot be boosted.  It otherwise remains exactly the same If.  For 
instance, you were running with 3+2 your MV would function perfectly normally, 
it would just have less resilience to failure than your base table.

*nit*: I think we seem to be using the term “strong consistency” incorrectly a 
bunch here - at least, it doesn’t seem to me to be referring to LWT/CAS.

*edit*: clarity


was (Author: benedict):
This is a cool feature idea, so I hope you don’t mind my 2c.  I’m generally +1 
the idea, if done carefully.  Honestly, it makes so much sense that it should 
probably outright replace the standard approach for multi-DC clusters.

It might cause less consternation if we rebranded 3-1 as 2+1.  In reality, the 
“transient” replica is anything but a replica, it’s more of a temporary write 
log that must be consulted on each read/write that cannot communicate with 
sufficient actual replicas.

If instead of viewing this as “deleting one RF when safe to do so” - which 
quite reasonably gives people the heebyjeebies - we instead approach it as an 
availability optimisation to (e.g.) RF=2.  Modelled this way, we should be able 
to make the changes incrementally, optimising each path while maintaining a 
lower-bound behaviour of RF=2 at CL.ALL

This was also a more intuitive mental model for me, but in doing so I seem to 
have come up against some implied inconsistencies with things that have been 
said:

bq. With cheap quorums read at ONE and write at ALL works as you would expect. 
What won't work as you would expect is read at ONE and write at something less. 

I’m confused by this; how does reading at ONE differ in these scenarios?  It 
seems that with or without “cheap” quorum, writing at N-1 or less, and reading 
at, 1 provides no guarantees.

bq. Transient replicas based on incremental repair aren't going to work with 2i 
and MV because there is no efficient way to anti-compact them since they aren't 
stored in token order.

Is this true?  The transient nodes cannot anyway have their own paired MV, 
since they cannot generate MV deltas.  So all that seems necessary is that, on 
repair of the base table, we generate updates for our paired view - as we must 
already if I’m not mistaken…?  

The real problem seems that, given the above, we cannot easily support - in 
either proposed implementation - the "availability optimisations" to reads to a 
2i or MV.  At least, not without routing the update through at least one full 
replica before reaching the transient replica.

If (one day) we endorse my ancient proposal to require the coordinator role be 
fulfilled by an owning replica, for all non-batch queries, we might be able to 
support this.  But I think documenting that the availability optimisation is 
only applied to the base table is an acceptable behaviour for a first release.

bq. RF=3 with 3 full replicas seeking strong consistency with quorum

I’m also a little confused by the use of “strong consistency” - which _seems_ 
to implicitly mean “quorum read/write” - which isn’t strongly consistent…  if 
we mean CAS/LWT in these instances, it would be helpful to clarify

bq. Since for CL.ONE queries you would need to only use one of the replicas 
with all the data on it.

bq. Yes, I think there are going to be multiple places where this gets more 
complicated than it looks at first.

This is actually fairly consistent, if viewed through a 2+1 lens: *Every* 
operation must involve at least one full replica.  But any collection of N full 
local replicas serving at least one non-digest can be safely boosted to N+1 by 
the inclusion of a data request to a local transient replica.

To add some opinions to the implementation debate, it seems to me:

* Less work it probably needed for an sstable based approach.  If initially we 
only support Cheap Quorum, we could unconditionally stream sstables from 
transient replicas to all full local replicas, before immediately dropping them
** No risk of any weird algorithmic problems with this approach, and we could 
take our time engineering a safe and robust _optional_ full quorum if we decide 
its useful
** This could then quickly be optimised to e.g. separate by originating node so 
that reads and repairs only need to consult the data not sent by themselves
* As far as the token ring is concerned, the N+M proposal implies not treating 
the +M nodes as full members of the ring - we should have a 
secondary/transient/shadow ring, to guarantee we don’t accidentally treat any 
transient nodes as full replicas

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13442
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>            Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

Reply via email to