[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

Ariel Weisberg (JIRA) Tue, 10 Oct 2017 07:47:27 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-13442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198763#comment-16198763
 ]


Ariel Weisberg commented on CASSANDRA-13442:
--------------------------------------------

bq. 1) It is not working with ONE or LOCAL_ONE. Of course transient replication 
is an opt-in feature but it means users should be super-careful about issuing 
queries at ONE/LOCAL_ONE for the keyspaces having transient replication 
enabled. Considering that ONE/LOCAL_ONE is the default consistency level for 
drivers and spark connector, maybe should we throw exception whenever a query 
with those consistency level are issued against transiently replicated 
keyspaces ?
With just transient replication ONE and LOCAL_ONE continue to work correctly 
although anything token aware will need to be updated to get correct token 
aware behavior. Coordinators will always route ONE and LOCAL_ONE to a full 
replica. Thanks for pointing this out I missed the impact on token aware 
routing.

With cheap quorums read at ONE and write at ALL works as you would expect. What 
won't work as you would expect is read at ONE and write at something less. We 
will need to recognize that caveat and do something about it. Either 
documentation, errors, or change in functionality.

bq. 2) Consistency level and repair have been 2 distinct and orthogonal notions 
so far. With transient replication they are strongly tied. Indeed transient 
replication relies heavily on incremental repair. Of course it is a detail of 
impl, Ariel Weisberg has mentioned replicated hints as another impl alternative 
but in this case we're making transient replication dependent of hints impl. 
Same story
Yes you have to have a means with which to implement transient replication with 
some kind of efficiency.

bq. Saying 10-20x is really misleading. No one is actually going to see a 10 - 
20x improvement in disk usage. Even a reduction of 1/3 would be optimistic I'm 
sure.
It's use case specific certainly. It really depends on your outage lengths, 
host replacement SLA and the rate at which you rewrite your data set. If most 
of your data is at rest it's easily 100x. If you overwrite your data set every 
24 hours, have a node failure, a 24 hour host replacement SLA, and 16 vnodes 
then you will in the worst case only have 1/48 additional data for 24 hours at 
RF=3. Larger scale failures like loss of an entire rack might be worse I need 
to think about it more.

There is nothing magical about the results you will get from transient 
replication. If the transient replicas can't drop the data or spread it out 
across multiple nodes on failure then you won't benefit.

bq. Let's not pretend people running vnodes can actually run repairs.
Can you elaborate? I'm not an expert on the challenges of running repairs with 
vnodes other then the sheer number of them. Is this something that gets better 
with the new allocation algorithm and using fewer vnodes? IOW if running 16 
vnodes was practical would repair still not be viable?

Issues with repair is a reason for having alternatives like hint based 
transient replicas. The issue with those is that they don't work for heavy 
overwrite workloads that can fill a disk in 24 hours.

> Support a means of strongly consistent highly available replication with 
> tunable storage requirements
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13442
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13442
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction, Coordination, Distributed Metadata, Local 
> Write-Read Paths
>            Reporter: Ariel Weisberg
>
> Replication factors like RF=2 can't provide strong consistency and 
> availability because if a single node is lost it's impossible to reach a 
> quorum of replicas. Stepping up to RF=3 will allow you to lose a node and 
> still achieve quorum for reads and writes, but requires committing additional 
> storage.
> The requirement of a quorum for writes/reads doesn't seem to be something 
> that can be relaxed without additional constraints on queries, but it seems 
> like it should be possible to relax the requirement that 3 full copies of the 
> entire data set are kept. What is actually required is a covering data set 
> for the range and we should be able to achieve a covering data set and high 
> availability without having three full copies. 
> After a repair we know that some subset of the data set is fully replicated. 
> At that point we don't have to read from a quorum of nodes for the repaired 
> data. It is sufficient to read from a single node for the repaired data and a 
> quorum of nodes for the unrepaired data.
> One way to exploit this would be to have N replicas, say the last N replicas 
> (where N varies with RF) in the preference list, delete all repaired data 
> after a repair completes. Subsequent quorum reads will be able to retrieve 
> the repaired data from any of the two full replicas and the unrepaired data 
> from a quorum read of any replica including the "transient" replicas.
> Configuration for something like this in NTS might be something similar to { 
> DC1="3-1", DC2="3-2" } where the first value is the replication factor used 
> for consistency and the second values is the number of transient replicas. If 
> you specify { DC1=3, DC2=3 } then the number of transient replicas defaults 
> to 0 and you get the same behavior you have today.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13442) Support a means of strongly consistent highly available replication with tunable storage requirements

Reply via email to