Hi Andrés, Isaac,

Thank you for the detailed write-up, Andrés. Your investigation into the
FastBuilder.reset() bug was the starting point for our own analysis, which
led us to identify an additional impact beyond the ClassCastException.

Isaac — yes, we believe CASSANDRA-21260 and CASSANDRA-21216 are directly
related. CASSANDRA-21260 was filed by our team to track the SSTable header
contamination we've been seeing. Based on Andrés' findings about the stale
savedBuffer/savedNextKey in FastBuilder.reset(), we investigated whether
the same bug could explain our corrupted SSTable headers — and we believe
it does.

What we observed (CASSANDRA-21260)

We have been seeing corrupted SSTable headers where an SSTable for one
table contains column metadata belonging to a completely different table.
When we deserialize the on-disk SerializationHeader.Component and compare
it against the table's TableMetadata, we find column names that are not
part of the table's schema — they belong to another table in the same
keyspace. In one case, a table with ~2000 columns had 29 foreign columns
from a ~150-column table embedded in its SSTable header.

These corrupted SSTables are otherwise structurally valid — they are
accepted into the live set and only detected by explicit header validation
we added. The foreign columns do not correspond to dropped columns or any
prior schema version of the affected table. As noted in CASSANDRA-21260,
once a corrupted SSTable exists, compaction merges headers blindly, so the
contamination propagates to new SSTables indefinitely.

How the FastBuilder bug (CASSANDRA-21216) causes this

Building on Andrés' analysis of the FastBuilder state leakage, we traced a
path from the stale savedBuffer/savedNextKey all the way to on-disk SSTable
header contamination:

1. A schema disagreement (e.g. during column addition) causes an internode
READ_REQ deserialization to fail on a replica.
Columns.Serializer.deserialize() uses a thread-local pooled FastBuilder,
and if the table has more than 31 columns, the overflow populates
savedBuffer and savedNextKey before the exception. Since reset() does not
clear these fields, the FastBuilder is returned to the pool with stale
ColumnMetadata from the source table.

2. When a deletion-only mutation (partition delete or range tombstone) for
a different table is later deserialized on the same thread,
Columns.Serializer.deserialize() acquires the poisoned FastBuilder. The
stale ColumnMetadata from the source table are drained into the victim
table's Columns via propagateOverflow(). Because the mutation contains only
a deletion — no rows, no static row — no per-row column-subset
deserialization occurs, so the contaminated Columns survives without error.
(Mutations with actual row data would fail due to subset encoding
mismatches, which is why only deletion-only mutations propagate the
contamination silently.)

When the contaminated PartitionUpdate is applied to the memtable,
ColumnsCollector.update() records the foreign ColumnMetadata. At flush,
BigTableWriter.openFinal() writes the SSTable using the in-memory
SerializationHeader directly, bypassing toHeader() validation. The result
is an on-disk SSTable whose header contains columns from the wrong table.

This also affects small messages on the Netty event loop

Andrés, your investigation focused on wide tables where messages exceed the
~64KB large-message threshold and are deserialized on SEPWorker threads. We
found that the same contamination also occurs with small messages
deserialized on the Netty event loop.

For messages under 64KB, processSmallMessage() deserializes the payload
inline on the event loop thread, which has its own
TinyThreadLocalPool<FastBuilder>. Since Netty binds each channel to a
single EventLoop, messages from the same peer are handled by the same
thread — making thread reuse virtually guaranteed rather than probabilistic.

This lowers the trigger threshold significantly: the source table only
needs more than 31 columns (for FastBuilder overflow) rather than the ~4200
needed to exceed the large-message threshold. In our case, a 150-column
table was the contamination source. The 29 foreign columns we observed are
consistent with the 31 + 1 items retained in savedBuffer/savedNextKey,
minus a few consumed as internal BTree node keys during build().

Summary

We strongly support the proposed fix to clear savedBuffer and savedNextKey
in FastBuilder.reset(). Beyond the ClassCastException that Andrés
identified, the same bug can cause the silent SSTable header contamination
tracked in CASSANDRA-21260. We have written JVM dtests reproducing both the
large-message and small-message contamination paths and are happy to share
them.

Best regards
Runtian

>

Reply via email to