[ 
https://issues.apache.org/jira/browse/CASSANDRA-16759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369166#comment-17369166
 ] 

Jon Meredith commented on CASSANDRA-16759:
------------------------------------------

I've been investigating a few of the test failures and they seem to be related 
to the node not waiting to receive an up to date schema and starting bootstrap 
with the default schema which does not contain any non-system keyspaces so does 
not do any streaming.

In 4.0, MigrationCoordinator is responsible for awaiting having all  schema and 
it gets told about schema versions from the StorageService.onChange listener. 
It only processes the ApplicationState.SCHEMA entries if the endpoint exists in 
TokenMetadata.

Endpoints are added to TokenMetadata when StorageService.onJoin handles the 
STATUS or STATUS_WITH_PORT application states.

The EnumMap.values() that onJoin iterates over seems to return the application 
states in the order they are defined in the enum, so if STATUS is present, it 
comes first and all is good.

If STATUS is not present, like when a 4.0 cluster thinks there are no nodes 
with a version lower than 4.0 and gossip filters it out, then only the items in 
ApplicationState after STATUS_WITH_PORT (currently only SSTABLE_VERSIONS) will 
be processed by onChange. Then it takes a subsequent gossip of that 
ApplicationState to apply theother states which is making tests racy.

This is all very fiddly and I'm not 100% sure that's the exact sequence, but 
there is definitely a change in behavior for when nodes switch to not having 
STATUS any more.

I've pushed up a minimal change to onJoin to make it behave [on a 
branch|https://github.com/jonmeredith/cassandra/pull/new/marcuse/16759-fix-status-with-port],
 with  [CircleCI 
Here|https://app.circleci.com/pipelines/github/jonmeredith/cassandra?branch=marcuse%2F16759-fix-status-with-port]

A possible cleaner alternative solution would be to sort with a customer key 
comparator, but wasn't sure about performance during gossip storms.

> Avoid memoizing the wrong min cluster version during upgrades
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-16759
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16759
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>            Priority: Normal
>             Fix For: 4.0-rc2
>
>
> CASSANDRA-16525 avoids trying to calculate the cluster min version if 
> gossiper is not enabled.
> This makes us memoize the wrong version for up to a minute causing us to send 
> 4.0-messages to 3.0 nodes, for example in 
> [ColumnFilter|https://github.com/apache/cassandra/blob/05beda90a9206db165a3997a736ecb06f8dc695e/src/java/org/apache/cassandra/db/filter/ColumnFilter.java#L210]
> This was discovered by python upgrade dtests, 
> [here|https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/993/workflows/2afef6f0-1356-41f6-93dc-5385ac19dca1/jobs/5977/tests#failed-test-0]
>  after reverting CASSANDRA-15899 in CASSANDRA-16735



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to