Hi all,
I’d like to discuss my investigation into CASSANDRA-21216
<https://issues.apache.org/jira/browse/CASSANDRA-21216>, a bug that my team
encountered during schema modification on a very wide table (~4200 columns)
in a 4.1 Cassandra cluster that was also serving reads and writes.
*Events required to reproduce:*
We have found that this bug triggers in the following scenario:
- There is a schema disagreement between a coordinator node and a replica
node when a new column is added:
- The coordinator node is performing a read request for the newly
added column
- The replica node does not have the new column yet
- The size of the internode READ_REQ is greater than ~ 65 KB
- The thread that fails to deserialize the READ_REQ is later reused by the
MutationStage
- The internode READ_REQ contains more than 31 columns (populates
savedBuffer and savedNextKey during deserialization)
- There are no more than 31 rows in-memory for a partition being mutated
(Row BTree has one leaf node)
- No more than 31 rows are being added in the mutation (PartitionUpdate
BTree has one leaf node)
*Why this bug surfaces: *
This bug stems from the schema disagreement described above while adding
new columns to a table, which causes internode message deserialization to
fail. When a replica node does not find a column included in the READ_REQ
message, it throws this exception
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L489>,
causing the original request to fail. As a result, the BTree FastBuilder
build function
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/Columns.java#L493>
is never reached. On the happy path, the build function calls completeBuild
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L2420>,
which consumes savedBuffer and savedNextKey. These FastBuilder objects are
used to create multi-level BTrees and are populated when a tree under
construction exceeds 31 elements.
When an exception is thrown, the build method is not called. As a result,
these fields retain their stale ColumnMetadata values. Although the
FastBuilder does call a reset
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3325>
function when it closes the object, the function does not clear out the
savedBuffer or savedNextKey, unlike the AbstractUpdater reset function
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L3371-L3372>.
It is important that these objects are cleared out after use given that
they are shared locally per thread
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L2313>.
When Cassandra deserializes large messages, deserialization is dispatched
to an SEPWorker thread pool
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/net/InboundMessageHandler.java#L200>instead
of being handled in the Netty event loop. For our case, we observed that
internode READ_REQ messages after expanding SELECT * queries on the 4200
column table were passing the 65 KB threshold. If the same thread that
handled a failed message deserialization then handles a mutation request,
it might use the same FastBuilder with stale ColumnMetadata objects to merge
the existing BTree of Rows in a partition with the partition update
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/partitions/AtomicBTreePartition.java#L148>.
If the existing BTree of Rows and partition update are leaf nodes, the
BTree update function uses the FastBuilder instead of the Updater
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/utils/btree/BTree.java#L391-L393>
to rebuild the tree. The FastBuilder then empties out the ColumnMetadata
objects in the savedBuffer and savedNextKey into the BTree that contains
Row objects. The ClassCastException surfaces when we read or write from the
partition in memory with the corrupted BTree.
I found that the same bug could resurface in trunk, given that the
deserialization of large messages is still handled on the SEPWorker thread
pool, and the FastBuilder reset function has not been updated to clear
savedBuffer and savedNextKey. However, I’m not sure whether TCM will reduce
or eliminate the possibility of schema mismatches across a cluster.
More information about how we verified this theory, as well as the full
stack traces, can be found on the bug ticket.
*Proposed fixes: *
A possible fix to this issue that I’ve added here
<https://github.com/andresbeckruiz/cassandra/commit/14dbac67bee3917ce71cd18dc48ef19f5f0cf649>
and verified builds successfully on Cassandra 4.1 is to clear the
savedBuffer and savedNextKey in the reset function, as is done in the
AbstractUpdater; I also noticed that AbstractUpdater does not set
savedNextKey to null, so this could be worth adding as well.
Another possible fix could be to handle both small and large messages in
the Netty event loop. If the FastBuilder objects are not cleared out, this
will guarantee that the FastBuilder is not reused by SEPWorker threads,
which could potentially corrupt Row BTrees during a MutationStage. However,
this could possibly block the event loop thread.
Given that this patch affects critical paths in Cassandra, I would
appreciate any feedback on any potential side-effects I have not
considered. If the approach to fixing this bug seems reasonable, I can also
add a DTest to mimic this edge case.
Best,
Andrés