[ https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791756#comment-17791756 ]
Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:20 PM: ------------------------------------------------------------------------------ Sure. Say we are reading with LOCAL_QUORUM with a replication factor of 3. Now, let's understand the input parameters for the following: {code:java} public BlockingPartitionRepair(DecoratedKey key, Map<Replica, Mutation> repairs, ReplicaPlan.ForWrite repairPlan, Predicate<InetAddressAndPort> shouldBlockOn) {code} * {+}_repairPlan_{+}: This could have two local nodes _L1, L2_ and one remote node _R1_ as the _Selector_ [1] could select the remote node as [~curlylrt] mentioned above. So _repairPlan_ parameter has {{_}L1, L2, R1{_}} * {+}_repairs_{+}: This also has _L1_ and _R1_ because these two might be stale and require an update. L2 is not included as it has the latest data. * Now, at the following line, _pendingRepairs_ will have two nodes {{_}L1, R1{_}}. {code:java} this.pendingRepairs = new ConcurrentHashMap<>(repairs); {code} * so _adjustedBlockFor_ will be set to 3 at the following line: {code:java} int adjustedBlockFor = repairPlan.writeQuorum(); {code} * Let's analyse L1, L2, and R1 for the following condition: {code:java} if (!repairs.containsKey(participant) && shouldBlockOn.test(participant.endpoint())) {code} * ** L1: _false_ as it is a participant ** L2: _true_ as it is not a participant and local as well //{_}adjustedBlockFor{_} 3-->2 ** R1: _false_ as it is a participant * The following lines sets _blockFor_ and _latch_ to the value 2 {code:java} this.blockFor = adjustedBlockFor; ... latch = newCountDownLatch(Math.max(blockFor, 0));{code} * But when we receive a response from {_}R1{_}, then line 122 [2] {color:#de350b}excludes{color} the response, as a result, the latch does not decrement {code:java} void ack(InetAddressAndPort from) { if (shouldBlockOn.test(from)) { pendingRepairs.remove(repairPlan.lookup(from)); latch.decrement(); } } {code} [1] [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L411] [2] [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L122C9-L122C9] was (Author: chovatia.jayd...@gmail.com): Sure. Say we are reading with LOCAL_QUORUM with a replication factor of 3. Now, let's understand the input parameters for the following: {code:java} public BlockingPartitionRepair(DecoratedKey key, Map<Replica, Mutation> repairs, ReplicaPlan.ForWrite repairPlan, Predicate<InetAddressAndPort> shouldBlockOn) {code} * {+}_repairPlan_{+}: This could have one local node _L1_ and one remote node _R1_ as the _Selector_ [1] could select the remote node as [~curlylrt] mentioned above. So _repairPlan_ parameter has {{_}L1, R1{_}} * {+}_repairs_{+}: This also has _L1_ and _R1_ because these two might be stale and require an update. * Now, at the following line, _pendingRepairs_ will have two nodes {{_}L1, R1{_}}. {code:java} this.pendingRepairs = new ConcurrentHashMap<>(repairs); {code} * so _adjustedBlockFor_ will be set to 2 at the following line: {code:java} int adjustedBlockFor = repairPlan.writeQuorum(); {code} * This if condition will be _false_ always even for _R1_ because _repairs_ include {_}R1{_}, as a result, _adjustedBlockFor_ remains unchanged, i.e., _2_ {code:java} if (!repairs.containsKey(participant) && shouldBlockOn.test(participant.endpoint())) {code} * The following lines sets _blockFor_ and _latch_ to the value 2 {code:java} this.blockFor = adjustedBlockFor; ... latch = newCountDownLatch(Math.max(blockFor, 0));{code} * But when we receive a response from {_}R1{_}, then line 122 [2] {color:#de350b}excludes{color} the response, as a result, the latch does not decrement {code:java} void ack(InetAddressAndPort from) { if (shouldBlockOn.test(from)) { pendingRepairs.remove(repairPlan.lookup(from)); latch.decrement(); } } {code} [1] [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L411] [2] [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L122C9-L122C9] > local consistencies may get timeout if blocking read repair is sending the > read repair mutation to other DC > ------------------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-19120 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19120 > Project: Cassandra > Issue Type: Bug > Reporter: Runtian Liu > Priority: Normal > Attachments: image-2023-11-29-15-26-08-056.png, signature.asc > > > For a two DCs cluster setup. When a new node is being added to DC1, for > blocking read repair triggered by local_quorum in DC1, it will require to > send read repair mutation to an extra node(1)(2). The selector for read > repair may select *ANY* node that has not been contacted before(3) instead of > selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% > timeout because of the bug described below: > When we initialized the latch(4) for blocking read repair, the shouldBlockOn > function will only return true for local nodes(5), the blockFor value will be > reduced if a local node doesn't require repair(6). The blockFor is same as > the number of read repair mutation sent out. But when the coordinator node > receives the response from the target nodes, the latch only count down for > nodes in same DC(7). The latch will wait till timeout and the read request > will timeout. > This can be reproduced if you have a constant load on a 3 + 3 cluster when > adding a node. If you have someway to trigger blocking read repair(maybe by > adding load using stress tool). If you use local_quorum consistency with a > constant read after write load in the same DC that you are adding node. You > will see read timeout issue from time to time because of the bug described > above > > I think for read repair when selecting the extra node to do repair, we should > prefer local nodes than the nodes from other region. Also, we need to fix the > latch part so even if we send mutation to the nodes in other DC, we don't get > a timeout. > (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455] > (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183] > (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458] > (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96] > (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71] > (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88] > (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org