[ https://issues.apache.org/jira/browse/CASSANDRA-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925260#comment-15925260 ]
Jason Brown commented on CASSANDRA-10726: ----------------------------------------- [~xiaolong...@gmail.com] Thanks for the comments they add more context to the patch. Here's my initial round of comments on your patch: - general nit: code style is incorrect wrt braces - rename {{ConsistencyLevel#isMajority}} to {{ConsistencyLevel#isQuorum}} as it's more in line with our general naming conventions - {{AbstractReadExecutor#getReadRepairExtra}} -- function needs a javadoc comment -- rename {{#readRepairExtra}} to ... ??? "spareReadRepairNode". I would like to make this Optional<InetAddress> as it makes it clearer that it's not required (rather than piggybacking on "null" to indicate that sematic meaning). -- {{readRepairExtra = allSortedReplicas.get(blockForReplicas.size());}} - this seems unsafe as it assumes the {{List}}s are ordered the same. - {{DataResolver}} -- {{#reppairExtra}} is misspelled. Further, make it consistent with whatever we name it in {{AbstractReadExecutor}} -- {{#close}} - don't add the exception to the {{Tracing.trace()}} call, and it's debatable if you need it on the {{logger.debug}} line, as well. -- {{repairResponseRequestMap}} and {{#responseCntSnapshot}} - I don't think you actually need these. You just care about the number of replicas that have responded, and you can just call {{responses.size()}} where you actually care about (in {{#waitRepairToFinishWithPossibleRetry()}}. -- {{#distinctHostNum()}} I'm not sure why you need this as we should only send the message to distinct hosts. I think you can just use {{results.size()}} instead. -- {{#waitRepairToFinishWithPossibleRetry}} --- wrt the block starting at line 224. You iterate over all the timed out entries, but you need to cover the full merge set and send that "repairExtra" node. You don't know what data that replica has or needs, thus you need to send the full merged set. I'm not sure off the top of my head where/how to grab the merged rows, but I'm sure you can figure it out ;) --- Furthermore, once you get the merged set, you do not need to send multiple messages to the target node, and can instead send one. Thus you can simplify the block starting at line 224 as such: {code} Tracing.trace("retry read-repair-mutation to {}", reppairExtra); PartitionUpdate update = // get merged result from some location ... MessageOut messageOut = new Mutation(update).createMessage(MessagingService.Verb.READ_REPAIR); AsyncOneResponse response = MessagingService.instance().sendRR(messageOut,reppairExtra); response.get(waitTimeNanos, TimeUnit.NANOSECONDS); // TimeoutException will be thrown .... {code} The biggest thing this patch needs is testing. You might be able to unit test this one (in fact [~xiaolong...@gmail.com] spoke offline about some idea about how to do it), but it will take some time (worthwhile, in my opinion). A dtest will probably be required, as well, even though that will get tricky - byteman will probably necessary to help you out. > Read repair inserts should not be blocking > ------------------------------------------ > > Key: CASSANDRA-10726 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10726 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Reporter: Richard Low > Assignee: Xiaolong Jiang > Fix For: 3.0.x > > > Today, if there’s a digest mismatch in a foreground read repair, the insert > to update out of date replicas is blocking. This means, if it fails, the read > fails with a timeout. If a node is dropping writes (maybe it is overloaded or > the mutation stage is backed up for some other reason), all reads to a > replica set could fail. Further, replicas dropping writes get more out of > sync so will require more read repair. > The comment on the code for why the writes are blocking is: > {code} > // wait for the repair writes to be acknowledged, to minimize impact on any > replica that's > // behind on writes in case the out-of-sync row is read multiple times in > quick succession > {code} > but the bad side effect is that reads timeout. Either the writes should not > be blocking or we should return success for the read even if the write times > out. -- This message was sent by Atlassian JIRA (v6.3.15#6346)