[ 
https://issues.apache.org/jira/browse/CASSANDRA-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784225#comment-17784225
 ] 

Paulo Motta commented on CASSANDRA-18968:
-----------------------------------------

bq. Whole "waiting for gossip to settle" machinery is ... not ideal. Yes, it 
works in most of the situations but there are edge cases when it does not, e.g. 
when there are large clusters, it may happen that it may evaluate that gossip 
is "settled" falsely because it took so much time to detect any changes that it 
was thinking it is settled.

I'm aware waitToSettle is not reliable. Nevertheless I think having a 
"best-effort" skipping of this check when 3.X nodes are detected in gossip is 
valuable. This will mostly work as long as gossip with a single node was 
successful, since it will get the latest known versions of the other nodes. 

In the case where the gossip information is absent and there are 3.X nodes 
present in the cluster, it's not a big deal - the check will just be executed 
and the timeout warning above will be unnecessarily emitted.

We just don't want to skip this check when *all nodes are upgraded to 4.x* but 
I don't think this would happen if waitToSettle fails.

> StartupClusterConnectivityChecker fails on upgrade from 3.X
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-18968
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18968
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Startup and Shutdown
>            Reporter: Paulo Motta
>            Assignee: Isaac Reath
>            Priority: Normal
>              Labels: lhf
>             Fix For: 4.0.x, 4.1.x
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Starting up a new 4.X node on a 3.x cluster throws the following warning:
> {noformat}
> WARN  [main] 2023-10-27 15:58:22,234 
> StartupClusterConnectivityChecker.java:183 - Timed out after 10002 
> milliseconds, was waiting for remaining peers to connect: {dc1=[X.Y.Z.W, 
> A.B.C.D]}
> {noformat}
> I think this is because the PING messages used by the startup check are not 
> available on 3.X.
> To provide a smoother upgrade experience we should probably disable this 
> check on a mixed version clusters, or skip peers on versions < 4.x when doing 
> the connectivity check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to