If an upgrade involves changing the schema, I think backwards
compatibility would be out of the question?

On Tue, Dec 12, 2023 at 10:36 AM Jeff Jirsa <jji...@gmail.com> wrote:

> This deserves a JIRA
>
>
>
> On Tue, Dec 12, 2023 at 8:30 AM Sebastian Marsching <
> sebast...@marsching.com> wrote:
>
>> Hi,
>>
>> while upgrading our production cluster from C* 3.11.14 to 4.1.3, we
>> experienced the issue that some SELECT queries failed due to supposedly no
>> replica being available. The system logs on the C* nodes where full of
>> messages like the following one:
>>
>> ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68
>> - Exception in thread Thread[ReadStage-1,5,SharedPool]
>> java.lang.IllegalStateException: [channel_data_id, control_system_type,
>> server_id, decimation_levels] is not a subset of [channel_data_id]
>>         at
>> org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593)
>>         at
>> org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95)
>>         at
>> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80)
>>         at
>> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308)
>>         at
>> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>>         at
>> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:186)
>>         at
>> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:182)
>>         at
>> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>>         at
>> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337)
>>         at
>> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63)
>>         at org.apache.cassandra.net
>> .InboundSink.lambda$new$0(InboundSink.java:78)
>>         at org.apache.cassandra.net
>> .InboundSink.accept(InboundSink.java:97)
>>         at org.apache.cassandra.net
>> .InboundSink.accept(InboundSink.java:45)
>>         at org.apache.cassandra.net
>> .InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
>>         at
>> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>>         at
>> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
>>         at
>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>         at java.base/java.lang.Thread.run(Thread.java:829)
>>
>> This problem only persisted while the cluster had a mix of 3.11.14 and
>> 4.1.3 nodes. As soon as the last node was updated, the problem disappeared
>> immediately, so I suspect that it was somehow caused by the unavoidable
>> schema inconsistency during the upgrade.
>>
>> I just wanted to give everyone who hasn’t upgraded yet a heads up, so
>> that they are aware that this problem might exist. Interestingly, it seems
>> like not all queries involving the affected table were affected by this
>> problem. As far as I am aware, no schema changes have ever been made to the
>> affected table, so I am pretty certain that the schema inconsistencies were
>> purely related to the upgrade process.
>>
>> We hadn’t noticed this problem when testing the upgrade on our test
>> cluster because there we first did the upgrade and then ran the test
>> workload. So, if you are worried you might be affected by this problem as
>> well, you might want to run your workload on the test cluster while having
>> mixed versions.
>>
>> I did not investigate the cause further because simply completing the
>> upgrade process seemed like the quickest option to get the cluster fully
>> operational again.
>>
>> Cheers,
>> Sebastian
>>
>>

Reply via email to