This deserves a JIRA


On Tue, Dec 12, 2023 at 8:30 AM Sebastian Marsching <sebast...@marsching.com>
wrote:

> Hi,
>
> while upgrading our production cluster from C* 3.11.14 to 4.1.3, we
> experienced the issue that some SELECT queries failed due to supposedly no
> replica being available. The system logs on the C* nodes where full of
> messages like the following one:
>
> ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68
> - Exception in thread Thread[ReadStage-1,5,SharedPool]
> java.lang.IllegalStateException: [channel_data_id, control_system_type,
> server_id, decimation_levels] is not a subset of [channel_data_id]
>         at
> org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593)
>         at
> org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523)
>         at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231)
>         at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205)
>         at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137)
>         at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125)
>         at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140)
>         at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95)
>         at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80)
>         at
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308)
>         at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>         at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:186)
>         at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:182)
>         at
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>         at
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337)
>         at
> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63)
>         at org.apache.cassandra.net
> .InboundSink.lambda$new$0(InboundSink.java:78)
>         at org.apache.cassandra.net
> .InboundSink.accept(InboundSink.java:97)
>         at org.apache.cassandra.net
> .InboundSink.accept(InboundSink.java:45)
>         at org.apache.cassandra.net
> .InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
>         at
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>         at
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
>         at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.base/java.lang.Thread.run(Thread.java:829)
>
> This problem only persisted while the cluster had a mix of 3.11.14 and
> 4.1.3 nodes. As soon as the last node was updated, the problem disappeared
> immediately, so I suspect that it was somehow caused by the unavoidable
> schema inconsistency during the upgrade.
>
> I just wanted to give everyone who hasn’t upgraded yet a heads up, so that
> they are aware that this problem might exist. Interestingly, it seems like
> not all queries involving the affected table were affected by this problem.
> As far as I am aware, no schema changes have ever been made to the affected
> table, so I am pretty certain that the schema inconsistencies were purely
> related to the upgrade process.
>
> We hadn’t noticed this problem when testing the upgrade on our test
> cluster because there we first did the upgrade and then ran the test
> workload. So, if you are worried you might be affected by this problem as
> well, you might want to run your workload on the test cluster while having
> mixed versions.
>
> I did not investigate the cause further because simply completing the
> upgrade process seemed like the quickest option to get the cluster fully
> operational again.
>
> Cheers,
> Sebastian
>
>

Reply via email to