[ https://issues.apache.org/jira/browse/CASSANDRA-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369166#comment-17369166 ]
Jon Meredith commented on CASSANDRA-16759: ------------------------------------------ I've been investigating a few of the test failures and they seem to be related to the node not waiting to receive an up to date schema and starting bootstrap with the default schema which does not contain any non-system keyspaces so does not do any streaming. In 4.0, MigrationCoordinator is responsible for awaiting having all schema and it gets told about schema versions from the StorageService.onChange listener. It only processes the ApplicationState.SCHEMA entries if the endpoint exists in TokenMetadata. Endpoints are added to TokenMetadata when StorageService.onJoin handles the STATUS or STATUS_WITH_PORT application states. The EnumMap.values() that onJoin iterates over seems to return the application states in the order they are defined in the enum, so if STATUS is present, it comes first and all is good. If STATUS is not present, like when a 4.0 cluster thinks there are no nodes with a version lower than 4.0 and gossip filters it out, then only the items in ApplicationState after STATUS_WITH_PORT (currently only SSTABLE_VERSIONS) will be processed by onChange. Then it takes a subsequent gossip of that ApplicationState to apply theother states which is making tests racy. This is all very fiddly and I'm not 100% sure that's the exact sequence, but there is definitely a change in behavior for when nodes switch to not having STATUS any more. I've pushed up a minimal change to onJoin to make it behave [on a branch|https://github.com/jonmeredith/cassandra/pull/new/marcuse/16759-fix-status-with-port], with [CircleCI Here|https://app.circleci.com/pipelines/github/jonmeredith/cassandra?branch=marcuse%2F16759-fix-status-with-port] A possible cleaner alternative solution would be to sort with a customer key comparator, but wasn't sure about performance during gossip storms. > Avoid memoizing the wrong min cluster version during upgrades > ------------------------------------------------------------- > > Key: CASSANDRA-16759 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16759 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination > Reporter: Marcus Eriksson > Assignee: Marcus Eriksson > Priority: Normal > Fix For: 4.0-rc2 > > > CASSANDRA-16525 avoids trying to calculate the cluster min version if > gossiper is not enabled. > This makes us memoize the wrong version for up to a minute causing us to send > 4.0-messages to 3.0 nodes, for example in > [ColumnFilter|https://github.com/apache/cassandra/blob/05beda90a9206db165a3997a736ecb06f8dc695e/src/java/org/apache/cassandra/db/filter/ColumnFilter.java#L210] > This was discovered by python upgrade dtests, > [here|https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/993/workflows/2afef6f0-1356-41f6-93dc-5385ac19dca1/jobs/5977/tests#failed-test-0] > after reverting CASSANDRA-15899 in CASSANDRA-16735 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org