It's import to point out the difference between Read Repair, in the context of 
the read_repair_chance setting, and Consistent Reads in the context of the CL 
setting. 

If RR is active on a request it means the request is sent to ALL UP nodes for 
the key and the RR process is ASYNC to the request.    If all of the nodes 
involved in the request return to the coordinator before rpc_timeout 
ReadCallback.maybeResolveForRepair() will put a repair task into the 
READ_REPAIR stage. This will compare the values and IF there is a 
DigestMismatch it will start a Row Repair read that reads the data from all 
nodes and MAY result in differences being detected and fixed. 

All of this is outside of the processing of your read request. It is separate 
from the stuff below.

Inside the user read request when ReadCallback.get() is called and CL nodes 
have responded the responses are compared. If a DigestMismatch happens then a 
Row Repair read is started, the result of this read is returned to the user. 
This Row Repair read MAY detect differences, if it does it resolves the super 
set, sends the delta to the replicas and returns the super set value to be 
returned to the client. 

> I'm still missing, how read repairs behave. Just extending your example for
> the following case: 
The example does not use Read Repair, it is handled by Consistent Reads. 

The purpose of RR is to reduce the probability that a read in the future using 
any of the replicas will result in a Digest Mismatch. "Any of the replicas" 
means ones that were not necessary for this specific read request. 

> 2. You do a write operation (W1) with quorom of val=2
> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
If the write has not completed then it is not a successful write at the 
specified CL as it could fail now.

Therefor the R +W > N Strong Consistency guarantee does not apply at this exact 
point in time. A read to the cluster at this exact point in time using QUOURM 
may return val2 or val1. Again the operation W1 has not completed, if read R' 
starts and completes while W1 is processing it may or may not return the result 
of W1.
 
> In this case, for read R1, the value val2 does not have a quorum. Would read
> R1 return val2 or val4 ? 

If val4 is in the memtable on node before the second read the result will be 
val4.  
Writes that happen between the initial read and the second read after a Digest 
Mismatch are included in the read result.

The way I think about consistency is "what value do reads see if writes stop":

* If you have R + W > N, so all writes succeeded at CL QUOURM, all successful 
reads are guaranteed to see the last write. 
* If you are using a low CL and/or had a failed writes at QUOURM then R +  W < 
N. All successful reads will *eventually* see the last value written, and they 
are guaranteed to return the value of a previous write or no value. Eventually 
background Read Repair, Hinted Handoff  or nodetool repair will repair the 
inconsistency. 

Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 4:39 AM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote:

>> Thanks Zhang. But, this again seems a little strange thing to do, since
>> one
>> (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
>> read failure while there are still enough number of replicas (R1 and R3)
>> live to satisfy a read.
> 
> 
> He means in the case where all 3 nodes are liveŠ.if a node is down,
> naturally it redirects to the other node and still succeeds because it
> found 2 nodes even with one node down(feel free to test this live though
> !!!!!)
> 
>> 
>> Thanks for the example Dean. This definitely clears things up when you
>> have
>> an overlap between the read and the write, and one comes after the other.
>> I'm still missing, how read repairs behave. Just extending your example
>> for
>> the following case:
>> 
>> 1. node1 = val1 node2 = val1 node3 = val1
>> 
>> 2. You do a write operation (W1) with quorom of val=2
>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
>> 
>> 3. Now with a read (R1) from node1 and node2, a read repair will be
>> initiated that needs to write val2 on node 1.
>> node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not
>> complete
>> yet)
>> 
>> 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
>> now arrives at node 1 but sees a newer value val4.
>> node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete,
>> read
>> repair val2 not complete)
>> 
>> In this case, for read R1, the value val2 does not have a quorum. Would
>> read
>> R1 return val2 or val4 ?
> 
>> 
> At this point as Manu suggests, you need to look at the code but most
> likely what happens is they lock that row, receive the write in memory(ie.
> Not losing it) and return to client, caching it so as soon as read-repair
> is over, it will write that next value.  Ie. Your client would receive
> val2 and val4 would be the value in the database right after you received
> val2.  Ie. When a client interacts with cassandra and you have tons of
> writes to a row, val1, val2, val3, val4 in a short time period, just like
> a normal database, your client may get one of those 4 values depending on
> here the read gets inserted in the order of the writesŠsame as a normal
> RDBMS.  The only thing you don't have is the atomic nature with other rows.
> 
> NOTICE: they would not have to cache val4 very long, and if a newer write
> came in, they would just replace it with that newer val and cache that one
> instead so it would not be a queueŠbut this is all just a guessŠread the
> code if you really want to know.
> 
>> 
>> 
>> Zhang, Manu wrote
>>> And we don't send read request to all of the three replicas (R1, R2, R3)
>>> if CL=QUOROM; just 2 of them depending on proximity
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>> -ReadRepair-exactly-do-tp7583261p7583372.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at
>> Nabble.com.
> 

Reply via email to