[ https://issues.apache.org/jira/browse/CASSANDRA-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495254#comment-16495254 ]
Christian Spriegel edited comment on CASSANDRA-14480 at 5/30/18 2:55 PM: ------------------------------------------------------------------------- I did some more testing and tried the following change in StorageProxy.SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(): {code:java} repairHandler = new ReadCallback(resolver, ConsistencyLevel.ALL, consistency.blockFor(keyspace), // was: executor.getContactedReplicas().size() command, keyspace, executor.handler.endpoints);{code} This fixed the issue in my test-scenario. But it causes the read-repair to only repair to only repair 2 our of my 3 replicas, in cases where all 3 replicas would be available. I could imagine an alternative solution where maybeAwaitFullDataRead() would wait for 3 replicas, but in case of an RTE it could check if 2 responded and treat that as a successful read. was (Author: christianmovi): I did some more testing and tried the following change in StorageProxy.SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(): {code:java} repairHandler = new ReadCallback(resolver, ConsistencyLevel.ALL, consistency.blockFor(keyspace), // was: executor.getContactedReplicas().size() command, keyspace, executor.handler.endpoints);{code} This fixed the issue in my test-scenario. But it causes the read-repair to only repair to only repair 2 our of my 3 replicas, in cases where all 3 replicas would be available. > Digest mismatch requires all replicas to be responsive > ------------------------------------------------------ > > Key: CASSANDRA-14480 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14480 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Christian Spriegel > Priority: Major > Attachments: Reader.java, Writer.java, schema_14480.cql > > > I ran across a scenario where a digest mismatch causes a read-repair that > requires all up nodes to be able to respond. If one of these nodes is not > responding, then the read-repair is being reported to the client as > ReadTimeoutException. > > My expection would be that a CL=QUORUM will always succeed as long as 2 nodes > are responding. But unfortunetaly the third node being "up" in the ring, but > not being able to respond does lead to a RTE. > > > I came up with a scenario that reproduces the issue: > # set up a 3 node cluster using ccm > # increase the phi_convict_threshold to 16, so that nodes are permanently > reported as up > # create attached schema > # run attached reader&writer (which only connects to node1&2). This should > already produce digest mismatches > # do a "ccm node3 pause" > # The reader will report a read-timeout with consistency QUORUM (2 responses > were required but only 1 replica responded). Within the > DigestMismatchException catch-block it can be seen that the repairHandler is > waiting for 3 responses, even though the exception says that 2 responses are > required. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org