[ https://issues.apache.org/jira/browse/CASSANDRA-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516806#comment-17516806 ]
Alex Petrov edited comment on CASSANDRA-17140 at 4/4/22 2:42 PM: ----------------------------------------------------------------- bq. Unfortunately it is not the same amount of work on both side. I think it may be less work to write a new test that does more than to keep fixing this one. I'm not suggesting everyone to do this, I'm saying I think it's just a more productive thing to do. In other words, next dtest I'll be fixing, I'll just port right away. I realise my wording did not reflect that precisely. Sorry for not being specific. bq. What did I miss? I've mentioned above that this error should be transient. In other words, even though there's a chance of asynchronously switching to new hash version between the nodes because there's no single moment when all nodes agree on the version, but every node [flips behaviour|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L636] at its own time. In other words, we switch and star returning a new hash. First query with old hash (for example, immediately after bounce), will fail with digest mismatch. However, while failing, it will also [prepare both hashes|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L680-L683], and subsequent query will "just" succeed. I think we've also explained this [while working on the patch|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R438]. So I can see two ways out of this: 1. *Avoid mismatches.* Because [this case|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R487] is the only one where we won't change the hash, I was suggesting removing `USE` (this would mean that during prepare and after upgrade {{qualifiedWithKeyspace}} will be null, and we'll return {{qualifiedWithoutKeyspace}} in both cases) and add keyspace to the executed statements. 2. *Retry on mismatches.* Simply add a retry that would tolerate failures during diverged gossip state, and retry failed query. In other words, I believe there's no bug in the code and we did validate this behaviour and made sure it's transitory. It's just that at that particular moment I can not allocate time to fix this dtest, for which I'm sorry and hope you'll understand. Hope you find this explanation satisfactory. was (Author: ifesdjeen): bq. Unfortunately it is not the same amount of work on both side. I think it may be less work to write a new test that does more than to keep fixing this one. I'm not suggesting everyone to do this, I'm saying I think it's just a more productive thing to do. In other words, next dtest I'll be fixing, I'll just port right away. I realise my wording did not reflect that precisely. Sorry for not being specific. bq. What did I miss? I've mentioned above that this error should be transient. In other words, even though there's a slight chance of asynchronously switching to new hash version between the nodes because there's no single moment when all nodes agree on the version, but every node [flips behaviour|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L636] at its own time. In other words, we switch and star returning a new hash. First query with old hash (for example, immediately after bounce), will fail with digest mismatch. However, while failing, it will also [prepare both hashes|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L680-L683], and subsequent query will "just" succeed. I think we've also explained this [while working on the patch|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R438]. So I can see two ways out of this: 1. *Avoid mismatches.* Because [this case|https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#diff-3ce9f6c4626262d6a4be26ecf3a57ef11fdb55537448def0c736115417487115R487] is the only one where we won't change the hash, I was suggesting removing `USE` (this would mean that during prepare and after upgrade {{qualifiedWithKeyspace}} will be null, and we'll return {{qualifiedWithoutKeyspace}} in both cases) and add keyspace to the executed statements. 2. *Retry on mismatches.* Simply add a retry that would tolerate failures during diverged gossip state, and retry failed query. In other words, I believe there's no bug in the code and we did validate this behaviour and made sure it's transitory. It's just that at that particular moment I can not allocate time to fix this dtest, for which I'm sorry and hope you'll understand. Hope you find this explanation satisfactory. > Broken test_rolling_upgrade - > upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x > ---------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-17140 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17140 > Project: Cassandra > Issue Type: Bug > Components: CI > Reporter: Yifan Cai > Assignee: Berenguer Blasi > Priority: Normal > Fix For: 4.0.x > > > The tests "test_rolling_upgrade" fail with the below error. > > [https://app.circleci.com/pipelines/github/yifan-c/cassandra/279/workflows/6340cd42-0b27-42c2-8418-9f8b56c57bea/jobs/1990] > > I am able to alway produce it by running the test locally too. > {{$ pytest --execute-upgrade-tests-only --upgrade-target-version-only > --upgrade-version-selection all --cassandra-version=4.0 > upgrade_tests/upgrade_through_versions_test.py::TestUpgrade_indev_3_11_x_To_indev_4_0_x::test_rolling_upgrade}} > > {code:java} > self = > <upgrade_tests.upgrade_through_versions_test.TestUpgrade_indev_3_0_x_To_indev_4_0_x > object at 0x7ffba4242fd0> > subprocs = [<Process(Process-1, stopped[SIGKILL] daemon)>, > <Process(Process-2, stopped[1] daemon)>] > def _check_on_subprocs(self, subprocs): > """ > Check on given subprocesses. > > If any are not alive, we'll go ahead and terminate any remaining > alive subprocesses since this test is going to fail. > """ > subproc_statuses = [s.is_alive() for s in subprocs] > if not all(subproc_statuses): > message = "A subprocess has terminated early. Subprocess > statuses: " > for s in subprocs: > message += "{name} (is_alive: {aliveness}), > ".format(name=s.name, aliveness=s.is_alive()) > message += "attempting to terminate remaining subprocesses now." > self._terminate_subprocs() > > raise RuntimeError(message) > E RuntimeError: A subprocess has terminated early. Subprocess > statuses: Process-1 (is_alive: True), Process-2 (is_alive: False), attempting > to terminate remaining subprocesses now.{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org