I would like to propose the following solution to CASSANDRA-17047,
CASSANDRA-18589, CASSANDRA-18591 etc. (the rationale and line of thought
included below) for pre-TCM Cassandra versions, and I'd love to hear your
feedback.

Definitions:
There are three states of knowledge of a node wrt to a particular column:

C - a column exists (there’s a column in TableMetadata::columns, no column
in TableMetadata::droppedColumns)
D - a column has been dropped (there’s no column in TableMetadata::columns,
and a column in TableMetadata::droppedColumns)
A - column has been dropped and then subsequently re-added (there’s a
column *both* in TableMetadata::columns and TableMetadata::droppedColumns)

For convenience, later in the document, I can just refer to column
creation, column drop, and column re-addition as C, D, or A events,
respectively.

Column lifecycle is always:
CDADAD…

A post-drop write is a write written with the “old” schema (where the
column was not yet dropped), but with a timestamp greater than the
respective column drop time.

Proposal:
The proposal consists of three postulates:

1. If either the coordinator or replica considers the column to be dropped
- it is OK for them to ignore the column data *regardless* of timestamp
Motivation:
Coordinator D should be allowed to return data as if the column was
removed; if the client wants an up-to-date view of data written after
subsequent A it should use an up-to-date coordinator (i.e. if you write to
coordinator A, and subsequently read from a coordinator with the previous D
state you may not be able to read your new write).
Replica D couldn’t have accepted any writes that happened after other nodes
performed subsequent A because of the check in
UpdateStatement::prepareInternal (i.e. only writes to columns visible in
the node’s schema are allowed). Thus, ignoring data does not affect writes
confirmed after the A event.

2. If neither the coordinator nor the replica considers the column to be
dropped - accept write resurrection
Motivation:
Without TCM there’s no race-free way to tell whether a write is a new,
confirmed write, and should be returned or an old, post-drop write.
Ignoring the written data may lead to data loss.

3. Read-repair should be disabled when there is schema inconsistency
between nodes
Motivation:
Eagerly ignoring written data by a replica in D state can lead to
read-repair storm. This is fundamentally caused by difference of opinion
between nodes with respect to existence of a particular write. A limitation
that read repairs can only be sent to replicas which agree on the current
version of the schema is simple and temporary, so it looks acceptable.
To mitigate the window for race conditions, it could be beneficial to
return a "schema disagreement detected" flag with the read from a replica.
Replicas already have the capacity to identify this condition based on the
ColumnFilter, which is sent by the coordinator alongside the read request.
This detection does not necessitate checking the EndpointState and so forth.

Full, in-depth spectrum of possible scenarios with some my loose
deliberations:

Coordinator knowledge

Replica knowledge

Current state, in particular in presence of post-drop writes

Expected behaviour (in users’ eyes)

D

C

IllegalStateException (CASSANDRA-17047, CASSANDRA-18532)

???

Ignore dropped column data?

D

D

ColumnFilter `fetches` all columns, including the dropped one;
NPE/AssertioError for post-drop-writes (CASSANDRA-18591, CASSANDRA-18589)

Ignore dropped column data

C

D

CASSANDRA-18591, CASSANDRA-18589

???

Replica knows more, but the coordinator is OK with returning the column.

The replica cannot guarantee that it will return the data anyway (it might
have already been removed).

Currently, the read path does not support returning data for dropped
columns.

The easiest solution would be to ignore the column as in the D D case Ignoring
the column may lead to read-repair storm

A

A

post-drop-writes are resurrected

???

The user would like the writes to be dropped.
We could do that by e.g. comparing write timestamp to the timestamp of
column re-addition. The problem is then that in the presence of clock skew
between nodes we might ignore reading a write that was confirmed (and e.g.
written in QUORUM). This sounds worse than resurrection.

A

D

Currently on replica it is not possible to distinguish this case from C D.
Also, there is no way to distinguish which node is “right” - whether the
column was dropped and has been re-added or if it was re-added and then
dropped again.

???

Again, the easiest is to ignore data for the dropped column (IIUC new
writes couldn’t have happened to the replica yet, because of the check in
UpdateStatement::prepareInternal - so, we will not remove a new, confirmed
write)

Or perhaps we should fail the read until schema convergence…

D

A

See above and DC case

See above and DC case

A

C

writes may be resurrected

???

I guess we can live with this…

We really shouldn’t do another schema change until all nodes converge on
the previous one. According to Jacek Lewandowski the driver should ensure
that we do indeed wait for schema convergence before making another schema
change.

C

A

Post-drop-writes may be resurrected

See above.

C

C

Uninteresting case


Questions & answers that I came up with while pondering on the problem:
1. Being a D replica, can we ignore post-drop writes data from columns that
are locally believed to be dropped?
In DD case it is OK.
In CD case the column has been dropped. Ignoring the data is consistent
with the user’s desire to drop it.
This in turn is consistent with Cassandra not having failed writes, only
“writes for which there is no confirmation of a successful write” - if the
user expressed the desire to write, we are OK with a sporadic appearance of
writes that had “failed”.
In AD case we don’t know if the column is dropped or not. And we cannot
distinguish between AD and CD currently. However, the data we ignore cannot
be a legitimate write we’d incorrectly remove, as we don’t let such writes
in the first place.

2. Being a D coordinator, can we ignore post-drop writes data?
I think the answer is yes. I believe that a limitation “when you read using
a coordinator that has outdated schema you may get results valid for that
outdated schema” is a fair, reasonable and understandable one.
That said, I don’t know how the above fits with the overall driver
philosophy etc.

3. Can we return post-drop writes data from columns that are locally
believed to be dropped (and the coordinator believes them to be alive) -
CD/AD cases?
Current read path doesn’t support that. There are dragons there.
May lead to write resurrection.
DD case could get a “special” handling to ignore data from the dropped
column.

4. Can we (as a replica) fail reads during schema inconsistency?
We already do, only in an uncontrolled way. We kind-of do best-effort, as
such reads need not always fail, so maybe it’s is OK as is?
Can we alleviate the pain with speculative retry - so that replicas that
agree on schema with the coordinator take over?
Maybe we should add a way to distinguish CD from AD and then fail reads for
AD/DA cases only?
Then we’d be able to do little improvements for CD (e.g. ignore the write)
and DC (e.g. use de-optimized filter).

On Mon, Jun 26, 2023 at 6:39 PM Jakub Zytka <jakub.zy...@datastax.com>
wrote:

> Hello,
>
> I want to discuss the possibilities of fixing CASSANDRA-18591 and
> CASSANDRA-18589 (~exceptions during reads), considering that TCM will
> become a reality soon.
> While both issues can be hit even on a single-node cluster, I think it's
> important for the solution to be at least TCM-friendly and easily
> extendable in the future.
>
> "Writes that occurred post-column drop" are at the core of both issues.
> For the purpose of this thread, I define such "post-drop-writes" using the
> DeserializationHelper::isDropped and
> DeserializationHelper::isDroppedComplexDeletion functions. Namely,
> post-drop-writes are writes with a timestamp exceeding the corresponding
> column drop time.
>
> The problem manifestation:
> Trying to SELECT rows with such writes (for instance, SELECT * ...) ends
> with AssertionFailure or NPE.
> Furthermore, the problem is persistent: it is not remedied by compaction
> and does not require schema disagreement to occur.
>
> Bug details:
>
> Post-drop-writes are created in two ways:
> 1. There's a race condition between two nodes, where a write is processed
> by a node that is unaware of the schema change, thus acquiring a larger
> timestamp.
> 2. Another race condition can occur within a node, where writes only
> verify the column's existence. However, there's no synchronization between
> column drops and writes. Hence, even with single-node clusters, we can
> still encounter post-drop-writes.
>
> I assume such writes are unavoidable now (and possibly also with TCM; at
> least in it's first incarnation).
>
> These post-drop-writes cause issues during reads, as different parts of
> the read path treat such writes differently.
> Importantly, these discrepancies occur without any schema disagreement or
> similar race conditions:
>
> 1. ColumnFilter
> ColumnFilter answers i.a. a question whether a specific column should be
> fetched. In the presence of dropped columns, the answer to this question
> differs depending on how it is asked. Specifically, there is a discrepancy
> between ColumnFilter::fetches() and ColumnFilter::fetchedColumns().
> fetches(someDroppedColumn) may be returning true, but the fetchedColumns()
> do not contain such a column.
> While surprising, counter-intuitive, and error-prone, it's not immediately
> problematic as far as correctness is concerned:
> Currently, it is `ColumnFilter::fetches` that is being used to tell
> whether data for a particular column should be skipped or not during
> deserialization.
> The reason that `ColumnFilter::fetches` may be returning true for each
> invocation regardless of the actual column is performance.
>
> So, for the time being, let's assume that a ColumnFilter is allowed to let
> in data of dropped columns and that such case should be properly documented
> etc. etc.
>
> 2. UnfilteredSerializer::read[Complex|Simple]Column + DeserializationHelper
>
> DeserializationHelper::isDropped and
> DeserializationHelper::isDroppedComplexDeletion help UnfilteredSerializer
> skip data that was written *before* column drop time.
> This implies that writes to dropped columns are expected at this point (or
> does it?). At least, that explains why the pre-drop-writes do not cause
> issues, even though ColumnFilter lets them in.
>
> 3. Row::Merger::ColumnDataReducer
>
> ColumnDataReducer has an optimization that if the schema does not contain
> complex columns, the complexBuilder is not constructed (==null).
> However, since we can read a write to a dropped complex column, we
> eventually hit NPE in getReduced, when we attempt to use the complexBuilder.
> While this NPE (CASSANDRA-18589) can be fixed easily, it suggests that we
> may already be expecting no post-drop-writes at this point in the code.
>
> 4. Unfiltered::serializeRowBody
>
> serializeRowBody explicitly asserts that the SerializationHelper is aware
> of a column we're about to serialize.
> This is not true for writes to dropped columns, causing CASSANDRA-18591.
>
> There might be other places in the read path that I haven't identified,
> which also make different assumptions about post-drop-writes.
>
>
> So, how do we resolve these issues?
> The most straightforward solution, which I do not discount, is to
> introduce and document a system limitation:
> Abstain from dropping columns or making schema changes during writes.
>
> But if we wish to tackle this issue programmatically, we need to consider
> various scenarios during read operations:
>
> When performing read:
> 1. the coordinator and replica share the same schema
> 2. the coordinator is aware of the dropped column but not the replica
> 3. the replica is aware of the dropped column, but not the coordinator
> 4. the coordinator is aware of the dropped column and its subsequent
> re-addition (both `columns` and `droppedColumns` in `TableMetadata` contain
> a column with the same name),
> 5. the coordinator is aware of the dropped column and its subsequent
> column re-addition, but the replica only knows about column drop
> 6. the replica is aware of the dropped column and its subsequent
> re-addition, but not the coordinator is not
> 7. The replica is aware of the dropped column and its subsequent
> re-addition, but the coordinator is only aware of the drop
>
> It is unclear to me what should happen in each case, especially
> considering that we want to (as I assume) avoid resurrecting writes.
> I've got several ideas, but first, I'd like to confirm that my thinking
> about the problem resonates with you. Does it?
>
> Finally, let's consider the role of Transactional Cluster Metadata (TCM).
> With TCM, distinguishing and handling some of the aforementioned scenarios
> may become easier or even feasible. However, my familiarity with TCM
> internals and future plans is limited. Therefore, I would greatly
> appreciate any insights into the relationship between TCM and scenarios
> such as the ones I've described above.
>
> --
> Jakub Zytka
> e. jakub.zy...@datastax.com
> w. www.datastax.com
>
>

-- 
Jakub Zytka
e. jakub.zy...@datastax.com
w. www.datastax.com

Reply via email to