I also wonder if this could be related to
https://issues.apache.org/jira/browse/CASSANDRA-21260 as well? It is
produced in a similar way, a schema disagreement happens, then, later on,
resulting SSTables have bad metadata in them. If the FastBuilder is reused
without being cleaned up, I could see how this bad SSTable metadata could
end up being created.

On Wed, Apr 8, 2026 at 5:57 PM Andrés Beck-Ruiz <[email protected]>
wrote:

> Hi all,
>
> I’d like to discuss my investigation into CASSANDRA-21216
> <https://issues.apache.org/jira/browse/CASSANDRA-21216>, a bug that my
> team encountered during schema modification on a very wide table (~4200
> columns) in a 4.1 Cassandra cluster that was also serving reads and writes.
>
> *Events required to reproduce:*
>
> We have found that this bug triggers in the following scenario:
>
> - There is a schema disagreement between a coordinator node and a replica
> node when a new column is added:
>       - The coordinator node is performing a read request for the newly
> added column
>       - The replica node does not have the new column yet
> - The size of the internode READ_REQ is greater than ~ 65 KB
> - The thread that fails to deserialize the READ_REQ is later reused by the
> MutationStage
> - The internode READ_REQ contains more than 31 columns (populates
> savedBuffer and savedNextKey during deserialization)
> - There are no more than 31 rows in-memory for a partition being mutated
> (Row BTree has one leaf node)
> - No more than 31 rows are being added in the mutation (PartitionUpdate
> BTree has one leaf node)
>
> *Why this bug surfaces: *
>
> This bug stems from the schema disagreement described above while adding
> new columns to a table, which causes internode message deserialization to
> fail. When a replica node does not find a column included in the READ_REQ
> message, it throws this exception
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L489>,
> causing the original request to fail. As a result, the BTree FastBuilder
> build function
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L493>
> is never reached. On the happy path, the build function calls
> completeBuild
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L2420>,
> which consumes savedBuffer and savedNextKey. These FastBuilder objects are
> used to create multi-level BTrees and are populated when a tree under
> construction exceeds 31 elements.
>
> When an exception is thrown, the build method is not called. As a result,
> these fields retain their stale ColumnMetadata values. Although the
> FastBuilder does call a reset
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3325>
> function when it closes the object, the function does not clear out the
> savedBuffer or savedNextKey, unlike the AbstractUpdater reset function
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3371-L3372>.
> It is important that these objects are cleared out after use given that
> they are shared locally per thread
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L2313>.
>
>
> When Cassandra deserializes large messages, deserialization is dispatched
> to an SEPWorker thread pool
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/net/InboundMessageHandler.java#L200>instead
> of being handled in the Netty event loop. For our case, we observed that
> internode READ_REQ messages after expanding SELECT * queries on the 4200
> column table were passing the 65 KB threshold. If the same thread that
> handled a failed message deserialization then handles a mutation request,
> it might use the same FastBuilder with stale ColumnMetadata objects to merge
> the existing BTree of Rows in a partition with the partition update
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java#L148>.
>
>
> If the existing BTree of Rows and partition update are leaf nodes, the
> BTree update function uses the FastBuilder instead of the Updater
> <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L391-L393>
> to rebuild the tree. The FastBuilder then empties out the ColumnMetadata
> objects in the savedBuffer and savedNextKey into the BTree that contains
> Row objects. The ClassCastException surfaces when we read or write from the
> partition in memory with the corrupted BTree.
>
> I found that the same bug could resurface in trunk, given that the
> deserialization of large messages is still handled on the SEPWorker thread
> pool, and the FastBuilder reset function has not been updated to clear
> savedBuffer and savedNextKey. However, I’m not sure whether TCM will reduce
> or eliminate the possibility of schema mismatches across a cluster.
>
> More information about how we verified this theory, as well as the full
> stack traces, can be found on the bug ticket.
>
> *Proposed fixes: *
>
> A possible fix to this issue that I’ve added here
> <https://github.com/andresbeckruiz/cassandra/commit/14dbac67bee3917ce71cd18dc48ef19f5f0cf649>
> and verified builds successfully on Cassandra 4.1 is to clear the
> savedBuffer and savedNextKey in the reset function, as is done in the
> AbstractUpdater; I also noticed that AbstractUpdater does not set
> savedNextKey to null, so this could be worth adding as well.
>
> Another possible fix could be to handle both small and large messages in
> the Netty event loop. If the FastBuilder objects are not cleared out, this
> will guarantee that the FastBuilder is not reused by SEPWorker threads,
> which could potentially corrupt Row BTrees during a MutationStage. However,
> this could possibly block the event loop thread.
>
> Given that this patch affects critical paths in Cassandra, I would
> appreciate any feedback on any potential side-effects I have not
> considered. If the approach to fixing this bug seems reasonable, I can also
> add a DTest to mimic this edge case.
>
> Best,
> Andrés
>

Reply via email to