If an upgrade involves changing the schema, I think backwards compatibility would be out of the question?
On Tue, Dec 12, 2023 at 10:36 AM Jeff Jirsa <jji...@gmail.com> wrote: > This deserves a JIRA > > > > On Tue, Dec 12, 2023 at 8:30 AM Sebastian Marsching < > sebast...@marsching.com> wrote: > >> Hi, >> >> while upgrading our production cluster from C* 3.11.14 to 4.1.3, we >> experienced the issue that some SELECT queries failed due to supposedly no >> replica being available. The system logs on the C* nodes where full of >> messages like the following one: >> >> ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68 >> - Exception in thread Thread[ReadStage-1,5,SharedPool] >> java.lang.IllegalStateException: [channel_data_id, control_system_type, >> server_id, decimation_levels] is not a subset of [channel_data_id] >> at >> org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593) >> at >> org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523) >> at >> org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231) >> at >> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205) >> at >> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137) >> at >> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125) >> at >> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140) >> at >> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95) >> at >> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80) >> at >> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308) >> at >> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201) >> at >> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:186) >> at >> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:182) >> at >> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48) >> at >> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337) >> at >> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63) >> at org.apache.cassandra.net >> .InboundSink.lambda$new$0(InboundSink.java:78) >> at org.apache.cassandra.net >> .InboundSink.accept(InboundSink.java:97) >> at org.apache.cassandra.net >> .InboundSink.accept(InboundSink.java:45) >> at org.apache.cassandra.net >> .InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430) >> at >> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133) >> at >> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142) >> at >> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >> at java.base/java.lang.Thread.run(Thread.java:829) >> >> This problem only persisted while the cluster had a mix of 3.11.14 and >> 4.1.3 nodes. As soon as the last node was updated, the problem disappeared >> immediately, so I suspect that it was somehow caused by the unavoidable >> schema inconsistency during the upgrade. >> >> I just wanted to give everyone who hasn’t upgraded yet a heads up, so >> that they are aware that this problem might exist. Interestingly, it seems >> like not all queries involving the affected table were affected by this >> problem. As far as I am aware, no schema changes have ever been made to the >> affected table, so I am pretty certain that the schema inconsistencies were >> purely related to the upgrade process. >> >> We hadn’t noticed this problem when testing the upgrade on our test >> cluster because there we first did the upgrade and then ran the test >> workload. So, if you are worried you might be affected by this problem as >> well, you might want to run your workload on the test cluster while having >> mixed versions. >> >> I did not investigate the cause further because simply completing the >> upgrade process seemed like the quickest option to get the cluster fully >> operational again. >> >> Cheers, >> Sebastian >> >>