[ 
https://issues.apache.org/jira/browse/CASSANDRA-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis reassigned CASSANDRA-833:
----------------------------------------

    Assignee: Sylvain Lebresne

Consider the case of CL=1, RF=3 to replicas A, B, C. We begin bootstrapping 
node D, and write a row K to the range being moved from C to D.

If the cluster is heavily loaded, it's possible that we write one copy to C, 
all the other writes get dropped, and once bootstrap completes we lose the row. 
Or if we write one copy to D, and cancel bootstrap, we again lose the row.

As said above, we want to satisfy CL for both the pre- and post-bootstrap nodes 
(in case bootstrap aborts).  This requires treating the old/new range owner as 
a unit: both D *and* C need to accept the write for it to count towards CL. So 
rather than considering {A, B, C, D} we should consider {A, B, (C, D)}.

This is a lot of complexity to introduce. A simplification that preserves 
correctness is to continue treating nodes independently but require *one more 
node* than normal CL. So CL=1 would actually require 2 nodes; CL=Q would 
require 3 (for RF=3), and so forth.  (Note that Q(3) + 1 is the same as Q(4), 
which is what the existing code computes; that is one reason I chose a CL=1 
example to start with, since those are *not* the same even for the simple case 
of RF=3.)

This would mean we may fail a few writes unnecessarily (a write to A or B is 
actually sufficient to satisfy CL=1, but this scheme would time that out) but 
never allow a write to succeed that would leave CL unsatisfied post-bootstrap 
(or if bootstrap is cancelled).

> fix consistencylevel during bootstrap
> -------------------------------------
>
>                 Key: CASSANDRA-833
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-833
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5
>            Reporter: Jonathan Ellis
>            Assignee: Sylvain Lebresne
>             Fix For: 0.8.1
>
>
> As originally designed, bootstrap nodes should *always* get *all* writes 
> under any consistencylevel, so when bootstrap finishes the operator can run 
> cleanup on the old nodes w/o fear that he might lose data.
> but if a bootstrap operation fails or is aborted, that means all writes will 
> fail until the ex-bootstrapping node is decommissioned.  so starting in 
> CASSANDRA-722, we just ignore dead nodes in consistencylevel calculations.
> but this breaks the original design.  CASSANDRA-822 adds a partial fix for 
> this (just adding bootstrap targets into the RF targets and hinting 
> normally), but this is still broken under certain conditions.  The real fix 
> is to consider consistencylevel for two sets of nodes:
>   1. the RF targets as currently existing (no pending ranges)
>   2.  the RF targets as they will exist after all movement ops are done
> If we satisfy CL for both sets then we will always be in good shape.
> I'm not sure if we can easily calculate 2. from the current TokenMetadata, 
> though.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to