Hi all,
I’m looking for some guidance on a Cassandra 5.0.x startup issue we’re
seeing and wanted to ask the user list if this behavior is expected or
already known.
We’re running a homogeneous 5.0.4 (also tested with 5.0.6) cluster with a
relatively large number of keyspaces, tables, and SAI indexes. On initial
cluster creation and provisioning of multiple keyspaces, everything
operates as expected. However, after stopping the cluster and restarting
all nodes, only the first node comes up successfully. Subsequent nodes fail
during startup with an assertion in the gossip thread while serializing the
SAI index status metadata.
ERROR [GossipStage:1] 2025-12-22 17:20:10,365
JVMStabilityInspector.java:70 - Exception in thread
Thread[GossipStage:1,5,GossipStage]
java.lang.RuntimeException: java.lang.AssertionError
at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:108)
at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
at
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
at
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.AssertionError: null
at org.apache.cassandra.db.TypeSizes.sizeof(TypeSizes.java:44)
at
org.apache.cassandra.gms.VersionedValue$VersionedValueSerializer.serializedSize(VersionedValue.java:381)
at
org.apache.cassandra.gms.VersionedValue$VersionedValueSerializer.serializedSize(VersionedValue.java:359)
at
org.apache.cassandra.gms.EndpointStateSerializer.serializedSize(EndpointState.java:344)
at
org.apache.cassandra.gms.EndpointStateSerializer.serializedSize(EndpointState.java:300)
at
org.apache.cassandra.gms.GossipDigestAckSerializer.serializedSize(GossipDigestAck.java:96)
at
org.apache.cassandra.gms.GossipDigestAckSerializer.serializedSize(GossipDigestAck.java:61)
at
org.apache.cassandra.net.Message$Serializer.payloadSize(Message.java:1088)
at org.apache.cassandra.net.Message.payloadSize(Message.java:1131)
at
org.apache.cassandra.net.Message$Serializer.serializedSize(Message.java:769)
It seems there was a fix to this same issue as reported in this DBA Stack
Exchange post
<https://dba.stackexchange.com/questions/343389/schema-changes-on-5-0-result-in-gossip-failures-o-a-c-db-db-typesizes-sizeof>
((CASSANDRA-20058 <https://issues.apache.org/jira/browse/CASSANDRA-20058>).
It seems to me though that the fix described in that post and ticket,
included in Cassandra 5.0.3, is incomplete? From what I can tell, the fix
seems to only be activated once the gossip state of the cluster has
converged but the error seems to occur before this happens. At the point
of the error, the minimum cluster version appears to be treated as unknown,
which causes Cassandra to fall back to the legacy (pre-5.0.3) index-status
serialization format. In our case, that legacy representation becomes large
enough to trigger the assertion, preventing the node from joining. Because
the node never joins, gossip never converges, and the newer 5.0.3+
compressed format is never enabled.
This effectively leaves the cluster stuck in a startup loop where only the
first node can come up.
As a sanity check, I locally modified the version-gating logic in
*IndexStatusManager.java *for the index-status serialization to always use
the newer compact format during startup, and with that change the cluster
started successfully.
private static boolean shouldWriteLegacyStatusFormat(CassandraVersion
minVersion)
{
return false; // return minVersion == null ||
(minVersion.major == 5 && minVersion.minor == 0 && minVersion.patch <
3);
}
This makes me suspect the issue is related to bootstrap ordering or version
detection rather than data corruption or configuration.
I posted a more detailed write-up
<https://dba.stackexchange.com/questions/349488/cassandra-5-0-4-startup-deadlock-gossip-uses-pre-5-0-3-encoding-due-to-version>
(with
stack traces and code references) on DBA StackExchange a few weeks ago but
haven’t received any feedback yet, so I wanted to ask here:
-
Is this startup/version-gating behavior expected in 5.0.x?
-
Is this a known limitation or bug?
-
Is there a recommended way to bootstrap or restart clusters in this
state?
Any insight would be appreciated. Happy to provide logs or additional
details if helpful.
Thanks,
Nicholas