[ https://issues.apache.org/jira/browse/CASSANDRA-17411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555089#comment-17555089 ]
Andres de la Peña edited comment on CASSANDRA-17411 at 6/16/22 1:43 PM: ------------------------------------------------------------------------ Looks good to me, +1. The new dtest reproduces the issue in the most clear way. Maybe we can combine the new filter and the one right before into a single one to save us a copy of the collection of replicas: {code:java} replicas = replicas.filter(replica -> FailureDetector.instance.isAlive(replica.endpoint()) && StorageService.instance.isRpcReady(replica.endpoint())); {code} However I doubt it would be worth the loss of readability. The test failure of {{HostReplacementTest}} on 4.0 seems to be CASSANDRA-16953, and the failure of {{RepairErrorsTest}} on 4.1 looks like the same timeout on {{{}Cluster#close(){}}}. Both can be reproduced without the patch, so CI looks clean to me. was (Author: adelapena): Looks good to me, +1. The new dtest reproduces the issue in the most clear way. Maybe we can combine this filter and the one right before into a single one to save us a copy of the collection of replicas: {code:java} replicas = replicas.filter(replica -> FailureDetector.instance.isAlive(replica.endpoint()) && StorageService.instance.isRpcReady(replica.endpoint())); {code} However I doubt it would be worth the loss of readability. The test failure of {{HostReplacementTest}} on 4.0 seems to be CASSANDRA-16953, and the failure of {{RepairErrorsTest}} on 4.1 looks like the same timeout on {{{}Cluster#close(){}}}. Both can be reproduced without the patch, so CI looks clean to me. > Network partition causes write ONE timeouts when using counters in Cassandra 4 > ------------------------------------------------------------------------------ > > Key: CASSANDRA-17411 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17411 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination > Reporter: Pere Balaguer > Assignee: Brandon Williams > Priority: Normal > Fix For: 4.0.x, 4.1-beta, 4.x > > Attachments: app.py > > > h5. Affected versions: > * 4.x > h5. Observed behavior: > When executing CL=ONE writes on a table with a counter column, if one of the > nodes is network partitioned from the others, clients keep sending requests > to it. > Even though this may be a "driver" problem, I've been able to reproduce it > with both java and python datastax drivers using their latest available > versions and given the behavior only changes depending on the Cassandra > version, well, here I am. > h5. Expected behavior: > In Cassandra 3 after all inflight requests fail (expected), no new requests > are sent to the partitioned node. The expectation is that Cassandra 4 behaves > the same way. > h5. How to reproduce: > {noformat} > # Create a cluster with the desired version, will go with 4.x for this example > ccm create bug-report -v 4.0.3 > ccm populate -n 2:2:2 > ccm start > # Create schemas and so on > CQL=$(cat <<END > CONSISTENCY ALL; > DROP KEYSPACE IF EXISTS demo; > CREATE KEYSPACE demo WITH REPLICATION = {'class': 'NetworkTopologyStrategy', > 'dc1': 2, 'dc2': 2, 'dc3': 2}; > CREATE TABLE demo.demo (pk uuid PRIMARY KEY, count counter) WITH compaction = > {'class': 'LeveledCompactionStrategy'}; > END > ) > ccm node1 cqlsh --verbose --exec="${CQL}" > # Launch the attached app.py > # requires cassandra-driver > python3 app.py "127.0.0.1" "9042" > # Wait a bit for the app to settle, proceed to next step once you see 3 > messages in stdout like: > # 2022-03-01 15:41:51,557 - target-dc2 - __main__ - INFO - Got 0/572 > (0.000000) timeouts/total_rqs in the last 1 minute > # Partition one node with iptables > iptables -A INPUT -p tcp --destination 127.0.0.1 --destination-port 7000 -j > DROP; iptables -A INPUT -p tcp --destination 127.0.0.1 --destination-port > 9042 -j DROP > {noformat} > Some time after executing the iptables command in cassandra-3 the output > should be similar to: > {noformat} > 2022-03-01 15:41:51,557 - target-dc2 - __main__ - INFO - Got 0/572 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:41:51,576 - target-dc3 - __main__ - INFO - Got 0/572 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:41:58,032 - target-dc1 - __main__ - INFO - Got 6/252 (2.380952) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:42:51,560 - target-dc2 - __main__ - INFO - Got 0/570 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:42:51,620 - target-dc3 - __main__ - INFO - Got 0/570 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:42:58,101 - target-dc1 - __main__ - INFO - Got 2/354 (0.564972) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:43:51,602 - target-dc2 - __main__ - INFO - Got 0/571 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:43:51,672 - target-dc3 - __main__ - INFO - Got 0/571 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:43:58,153 - target-dc1 - __main__ - INFO - Got 0/572 (0.000000) > timeouts/total_rqs in the last 1 minute > {noformat} > as the timeouts/rqs shows, in about 2 minutes the partitioned node stops > receiving traffic > while as in cassandra-4 > {noformat} > 2022-03-01 15:49:39,068 - target-dc3 - __main__ - INFO - Got 0/566 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:49:39,107 - target-dc2 - __main__ - INFO - Got 0/566 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:49:41,206 - target-dc1 - __main__ - INFO - Got 2/444 (0.450450) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:50:39,095 - target-dc3 - __main__ - INFO - Got 0/569 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:50:39,148 - target-dc2 - __main__ - INFO - Got 0/569 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:50:42,589 - target-dc1 - __main__ - INFO - Got 7/13 (53.846154) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:51:39,125 - target-dc3 - __main__ - INFO - Got 0/567 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:51:39,159 - target-dc2 - __main__ - INFO - Got 0/567 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:51:44,902 - target-dc1 - __main__ - INFO - Got 12/22 > (54.545455) timeouts/total_rqs in the last 1 minute > 2022-03-01 15:52:39,188 - target-dc2 - __main__ - INFO - Got 0/567 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:52:39,202 - target-dc3 - __main__ - INFO - Got 0/568 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:52:46,994 - target-dc1 - __main__ - INFO - Got 12/20 > (60.000000) timeouts/total_rqs in the last 1 minute > 2022-03-01 15:53:39,202 - target-dc2 - __main__ - INFO - Got 0/568 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:53:39,238 - target-dc3 - __main__ - INFO - Got 0/568 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:53:48,765 - target-dc1 - __main__ - INFO - Got 12/17 > (70.588235) timeouts/total_rqs in the last 1 minute > 2022-03-01 15:54:39,203 - target-dc2 - __main__ - INFO - Got 0/568 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:54:39,291 - target-dc3 - __main__ - INFO - Got 0/568 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:54:51,291 - target-dc1 - __main__ - INFO - Got 12/24 > (50.000000) timeouts/total_rqs in the last 1 minute > 2022-03-01 15:55:39,258 - target-dc2 - __main__ - INFO - Got 0/570 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:55:39,327 - target-dc3 - __main__ - INFO - Got 0/570 (0.000000) > timeouts/total_rqs in the last 1 minute > 2022-03-01 15:55:53,365 - target-dc1 - __main__ - INFO - Got 12/20 > (60.000000) timeouts/total_rqs in the last 1 minute > {noformat} > as the timeouts/rqs shows, the partitioned node keeps receving traffic until > the iptables rules are removed -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org