I also wonder if this could be related to https://issues.apache.org/jira/browse/CASSANDRA-21260 as well? It is produced in a similar way, a schema disagreement happens, then, later on, resulting SSTables have bad metadata in them. If the FastBuilder is reused without being cleaned up, I could see how this bad SSTable metadata could end up being created.
On Wed, Apr 8, 2026 at 5:57 PM Andrés Beck-Ruiz <[email protected]> wrote: > Hi all, > > I’d like to discuss my investigation into CASSANDRA-21216 > <https://issues.apache.org/jira/browse/CASSANDRA-21216>, a bug that my > team encountered during schema modification on a very wide table (~4200 > columns) in a 4.1 Cassandra cluster that was also serving reads and writes. > > *Events required to reproduce:* > > We have found that this bug triggers in the following scenario: > > - There is a schema disagreement between a coordinator node and a replica > node when a new column is added: > - The coordinator node is performing a read request for the newly > added column > - The replica node does not have the new column yet > - The size of the internode READ_REQ is greater than ~ 65 KB > - The thread that fails to deserialize the READ_REQ is later reused by the > MutationStage > - The internode READ_REQ contains more than 31 columns (populates > savedBuffer and savedNextKey during deserialization) > - There are no more than 31 rows in-memory for a partition being mutated > (Row BTree has one leaf node) > - No more than 31 rows are being added in the mutation (PartitionUpdate > BTree has one leaf node) > > *Why this bug surfaces: * > > This bug stems from the schema disagreement described above while adding > new columns to a table, which causes internode message deserialization to > fail. When a replica node does not find a column included in the READ_REQ > message, it throws this exception > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L489>, > causing the original request to fail. As a result, the BTree FastBuilder > build function > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L493> > is never reached. On the happy path, the build function calls > completeBuild > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L2420>, > which consumes savedBuffer and savedNextKey. These FastBuilder objects are > used to create multi-level BTrees and are populated when a tree under > construction exceeds 31 elements. > > When an exception is thrown, the build method is not called. As a result, > these fields retain their stale ColumnMetadata values. Although the > FastBuilder does call a reset > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3325> > function when it closes the object, the function does not clear out the > savedBuffer or savedNextKey, unlike the AbstractUpdater reset function > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3371-L3372>. > It is important that these objects are cleared out after use given that > they are shared locally per thread > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L2313>. > > > When Cassandra deserializes large messages, deserialization is dispatched > to an SEPWorker thread pool > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/net/InboundMessageHandler.java#L200>instead > of being handled in the Netty event loop. For our case, we observed that > internode READ_REQ messages after expanding SELECT * queries on the 4200 > column table were passing the 65 KB threshold. If the same thread that > handled a failed message deserialization then handles a mutation request, > it might use the same FastBuilder with stale ColumnMetadata objects to merge > the existing BTree of Rows in a partition with the partition update > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java#L148>. > > > If the existing BTree of Rows and partition update are leaf nodes, the > BTree update function uses the FastBuilder instead of the Updater > <https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L391-L393> > to rebuild the tree. The FastBuilder then empties out the ColumnMetadata > objects in the savedBuffer and savedNextKey into the BTree that contains > Row objects. The ClassCastException surfaces when we read or write from the > partition in memory with the corrupted BTree. > > I found that the same bug could resurface in trunk, given that the > deserialization of large messages is still handled on the SEPWorker thread > pool, and the FastBuilder reset function has not been updated to clear > savedBuffer and savedNextKey. However, I’m not sure whether TCM will reduce > or eliminate the possibility of schema mismatches across a cluster. > > More information about how we verified this theory, as well as the full > stack traces, can be found on the bug ticket. > > *Proposed fixes: * > > A possible fix to this issue that I’ve added here > <https://github.com/andresbeckruiz/cassandra/commit/14dbac67bee3917ce71cd18dc48ef19f5f0cf649> > and verified builds successfully on Cassandra 4.1 is to clear the > savedBuffer and savedNextKey in the reset function, as is done in the > AbstractUpdater; I also noticed that AbstractUpdater does not set > savedNextKey to null, so this could be worth adding as well. > > Another possible fix could be to handle both small and large messages in > the Netty event loop. If the FastBuilder objects are not cleared out, this > will guarantee that the FastBuilder is not reused by SEPWorker threads, > which could potentially corrupt Row BTrees during a MutationStage. However, > this could possibly block the event loop thread. > > Given that this patch affects critical paths in Cassandra, I would > appreciate any feedback on any potential side-effects I have not > considered. If the approach to fixing this bug seems reasonable, I can also > add a DTest to mimic this edge case. > > Best, > Andrés >
