It's import to point out the difference between Read Repair, in the context of the read_repair_chance setting, and Consistent Reads in the context of the CL setting.
If RR is active on a request it means the request is sent to ALL UP nodes for the key and the RR process is ASYNC to the request. If all of the nodes involved in the request return to the coordinator before rpc_timeout ReadCallback.maybeResolveForRepair() will put a repair task into the READ_REPAIR stage. This will compare the values and IF there is a DigestMismatch it will start a Row Repair read that reads the data from all nodes and MAY result in differences being detected and fixed. All of this is outside of the processing of your read request. It is separate from the stuff below. Inside the user read request when ReadCallback.get() is called and CL nodes have responded the responses are compared. If a DigestMismatch happens then a Row Repair read is started, the result of this read is returned to the user. This Row Repair read MAY detect differences, if it does it resolves the super set, sends the delta to the replicas and returns the super set value to be returned to the client. > I'm still missing, how read repairs behave. Just extending your example for > the following case: The example does not use Read Repair, it is handled by Consistent Reads. The purpose of RR is to reduce the probability that a read in the future using any of the replicas will result in a Digest Mismatch. "Any of the replicas" means ones that were not necessary for this specific read request. > 2. You do a write operation (W1) with quorom of val=2 > node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) If the write has not completed then it is not a successful write at the specified CL as it could fail now. Therefor the R +W > N Strong Consistency guarantee does not apply at this exact point in time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1. Again the operation W1 has not completed, if read R' starts and completes while W1 is processing it may or may not return the result of W1. > In this case, for read R1, the value val2 does not have a quorum. Would read > R1 return val2 or val4 ? If val4 is in the memtable on node before the second read the result will be val4. Writes that happen between the initial read and the second read after a Digest Mismatch are included in the read result. The way I think about consistency is "what value do reads see if writes stop": * If you have R + W > N, so all writes succeeded at CL QUOURM, all successful reads are guaranteed to see the last write. * If you are using a low CL and/or had a failed writes at QUOURM then R + W < N. All successful reads will *eventually* see the last value written, and they are guaranteed to return the value of a previous write or no value. Eventually background Read Repair, Hinted Handoff or nodetool repair will repair the inconsistency. Hope that helps. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/10/2012, at 4:39 AM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote: >> Thanks Zhang. But, this again seems a little strange thing to do, since >> one >> (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a >> read failure while there are still enough number of replicas (R1 and R3) >> live to satisfy a read. > > > He means in the case where all 3 nodes are liveŠ.if a node is down, > naturally it redirects to the other node and still succeeds because it > found 2 nodes even with one node down(feel free to test this live though > !!!!!) > >> >> Thanks for the example Dean. This definitely clears things up when you >> have >> an overlap between the read and the write, and one comes after the other. >> I'm still missing, how read repairs behave. Just extending your example >> for >> the following case: >> >> 1. node1 = val1 node2 = val1 node3 = val1 >> >> 2. You do a write operation (W1) with quorom of val=2 >> node1 = val1 node2 = val2 node3 = val1 (write val2 is not complete yet) >> >> 3. Now with a read (R1) from node1 and node2, a read repair will be >> initiated that needs to write val2 on node 1. >> node1 = val1; node2 = val2; node3 = val1 (read repair val2 is not >> complete >> yet) >> >> 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1 >> now arrives at node 1 but sees a newer value val4. >> node1 = val4; node2 = val2; node3 = val1 (write val4 is not complete, >> read >> repair val2 not complete) >> >> In this case, for read R1, the value val2 does not have a quorum. Would >> read >> R1 return val2 or val4 ? > >> > At this point as Manu suggests, you need to look at the code but most > likely what happens is they lock that row, receive the write in memory(ie. > Not losing it) and return to client, caching it so as soon as read-repair > is over, it will write that next value. Ie. Your client would receive > val2 and val4 would be the value in the database right after you received > val2. Ie. When a client interacts with cassandra and you have tons of > writes to a row, val1, val2, val3, val4 in a short time period, just like > a normal database, your client may get one of those 4 values depending on > here the read gets inserted in the order of the writesŠsame as a normal > RDBMS. The only thing you don't have is the atomic nature with other rows. > > NOTICE: they would not have to cache val4 very long, and if a newer write > came in, they would just replace it with that newer val and cache that one > instead so it would not be a queueŠbut this is all just a guessŠread the > code if you really want to know. > >> >> >> Zhang, Manu wrote >>> And we don't send read request to all of the three replicas (R1, R2, R3) >>> if CL=QUOROM; just 2 of them depending on proximity >> >> >> >> >> >> -- >> View this message in context: >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does >> -ReadRepair-exactly-do-tp7583261p7583372.html >> Sent from the cassandra-u...@incubator.apache.org mailing list archive at >> Nabble.com. >