I would like to propose the following solution to CASSANDRA-17047, CASSANDRA-18589, CASSANDRA-18591 etc. (the rationale and line of thought included below) for pre-TCM Cassandra versions, and I'd love to hear your feedback.
Definitions: There are three states of knowledge of a node wrt to a particular column: C - a column exists (there’s a column in TableMetadata::columns, no column in TableMetadata::droppedColumns) D - a column has been dropped (there’s no column in TableMetadata::columns, and a column in TableMetadata::droppedColumns) A - column has been dropped and then subsequently re-added (there’s a column *both* in TableMetadata::columns and TableMetadata::droppedColumns) For convenience, later in the document, I can just refer to column creation, column drop, and column re-addition as C, D, or A events, respectively. Column lifecycle is always: CDADAD… A post-drop write is a write written with the “old” schema (where the column was not yet dropped), but with a timestamp greater than the respective column drop time. Proposal: The proposal consists of three postulates: 1. If either the coordinator or replica considers the column to be dropped - it is OK for them to ignore the column data *regardless* of timestamp Motivation: Coordinator D should be allowed to return data as if the column was removed; if the client wants an up-to-date view of data written after subsequent A it should use an up-to-date coordinator (i.e. if you write to coordinator A, and subsequently read from a coordinator with the previous D state you may not be able to read your new write). Replica D couldn’t have accepted any writes that happened after other nodes performed subsequent A because of the check in UpdateStatement::prepareInternal (i.e. only writes to columns visible in the node’s schema are allowed). Thus, ignoring data does not affect writes confirmed after the A event. 2. If neither the coordinator nor the replica considers the column to be dropped - accept write resurrection Motivation: Without TCM there’s no race-free way to tell whether a write is a new, confirmed write, and should be returned or an old, post-drop write. Ignoring the written data may lead to data loss. 3. Read-repair should be disabled when there is schema inconsistency between nodes Motivation: Eagerly ignoring written data by a replica in D state can lead to read-repair storm. This is fundamentally caused by difference of opinion between nodes with respect to existence of a particular write. A limitation that read repairs can only be sent to replicas which agree on the current version of the schema is simple and temporary, so it looks acceptable. To mitigate the window for race conditions, it could be beneficial to return a "schema disagreement detected" flag with the read from a replica. Replicas already have the capacity to identify this condition based on the ColumnFilter, which is sent by the coordinator alongside the read request. This detection does not necessitate checking the EndpointState and so forth. Full, in-depth spectrum of possible scenarios with some my loose deliberations: Coordinator knowledge Replica knowledge Current state, in particular in presence of post-drop writes Expected behaviour (in users’ eyes) D C IllegalStateException (CASSANDRA-17047, CASSANDRA-18532) ??? Ignore dropped column data? D D ColumnFilter `fetches` all columns, including the dropped one; NPE/AssertioError for post-drop-writes (CASSANDRA-18591, CASSANDRA-18589) Ignore dropped column data C D CASSANDRA-18591, CASSANDRA-18589 ??? Replica knows more, but the coordinator is OK with returning the column. The replica cannot guarantee that it will return the data anyway (it might have already been removed). Currently, the read path does not support returning data for dropped columns. The easiest solution would be to ignore the column as in the D D case Ignoring the column may lead to read-repair storm A A post-drop-writes are resurrected ??? The user would like the writes to be dropped. We could do that by e.g. comparing write timestamp to the timestamp of column re-addition. The problem is then that in the presence of clock skew between nodes we might ignore reading a write that was confirmed (and e.g. written in QUORUM). This sounds worse than resurrection. A D Currently on replica it is not possible to distinguish this case from C D. Also, there is no way to distinguish which node is “right” - whether the column was dropped and has been re-added or if it was re-added and then dropped again. ??? Again, the easiest is to ignore data for the dropped column (IIUC new writes couldn’t have happened to the replica yet, because of the check in UpdateStatement::prepareInternal - so, we will not remove a new, confirmed write) Or perhaps we should fail the read until schema convergence… D A See above and DC case See above and DC case A C writes may be resurrected ??? I guess we can live with this… We really shouldn’t do another schema change until all nodes converge on the previous one. According to Jacek Lewandowski the driver should ensure that we do indeed wait for schema convergence before making another schema change. C A Post-drop-writes may be resurrected See above. C C Uninteresting case Questions & answers that I came up with while pondering on the problem: 1. Being a D replica, can we ignore post-drop writes data from columns that are locally believed to be dropped? In DD case it is OK. In CD case the column has been dropped. Ignoring the data is consistent with the user’s desire to drop it. This in turn is consistent with Cassandra not having failed writes, only “writes for which there is no confirmation of a successful write” - if the user expressed the desire to write, we are OK with a sporadic appearance of writes that had “failed”. In AD case we don’t know if the column is dropped or not. And we cannot distinguish between AD and CD currently. However, the data we ignore cannot be a legitimate write we’d incorrectly remove, as we don’t let such writes in the first place. 2. Being a D coordinator, can we ignore post-drop writes data? I think the answer is yes. I believe that a limitation “when you read using a coordinator that has outdated schema you may get results valid for that outdated schema” is a fair, reasonable and understandable one. That said, I don’t know how the above fits with the overall driver philosophy etc. 3. Can we return post-drop writes data from columns that are locally believed to be dropped (and the coordinator believes them to be alive) - CD/AD cases? Current read path doesn’t support that. There are dragons there. May lead to write resurrection. DD case could get a “special” handling to ignore data from the dropped column. 4. Can we (as a replica) fail reads during schema inconsistency? We already do, only in an uncontrolled way. We kind-of do best-effort, as such reads need not always fail, so maybe it’s is OK as is? Can we alleviate the pain with speculative retry - so that replicas that agree on schema with the coordinator take over? Maybe we should add a way to distinguish CD from AD and then fail reads for AD/DA cases only? Then we’d be able to do little improvements for CD (e.g. ignore the write) and DC (e.g. use de-optimized filter). On Mon, Jun 26, 2023 at 6:39 PM Jakub Zytka <jakub.zy...@datastax.com> wrote: > Hello, > > I want to discuss the possibilities of fixing CASSANDRA-18591 and > CASSANDRA-18589 (~exceptions during reads), considering that TCM will > become a reality soon. > While both issues can be hit even on a single-node cluster, I think it's > important for the solution to be at least TCM-friendly and easily > extendable in the future. > > "Writes that occurred post-column drop" are at the core of both issues. > For the purpose of this thread, I define such "post-drop-writes" using the > DeserializationHelper::isDropped and > DeserializationHelper::isDroppedComplexDeletion functions. Namely, > post-drop-writes are writes with a timestamp exceeding the corresponding > column drop time. > > The problem manifestation: > Trying to SELECT rows with such writes (for instance, SELECT * ...) ends > with AssertionFailure or NPE. > Furthermore, the problem is persistent: it is not remedied by compaction > and does not require schema disagreement to occur. > > Bug details: > > Post-drop-writes are created in two ways: > 1. There's a race condition between two nodes, where a write is processed > by a node that is unaware of the schema change, thus acquiring a larger > timestamp. > 2. Another race condition can occur within a node, where writes only > verify the column's existence. However, there's no synchronization between > column drops and writes. Hence, even with single-node clusters, we can > still encounter post-drop-writes. > > I assume such writes are unavoidable now (and possibly also with TCM; at > least in it's first incarnation). > > These post-drop-writes cause issues during reads, as different parts of > the read path treat such writes differently. > Importantly, these discrepancies occur without any schema disagreement or > similar race conditions: > > 1. ColumnFilter > ColumnFilter answers i.a. a question whether a specific column should be > fetched. In the presence of dropped columns, the answer to this question > differs depending on how it is asked. Specifically, there is a discrepancy > between ColumnFilter::fetches() and ColumnFilter::fetchedColumns(). > fetches(someDroppedColumn) may be returning true, but the fetchedColumns() > do not contain such a column. > While surprising, counter-intuitive, and error-prone, it's not immediately > problematic as far as correctness is concerned: > Currently, it is `ColumnFilter::fetches` that is being used to tell > whether data for a particular column should be skipped or not during > deserialization. > The reason that `ColumnFilter::fetches` may be returning true for each > invocation regardless of the actual column is performance. > > So, for the time being, let's assume that a ColumnFilter is allowed to let > in data of dropped columns and that such case should be properly documented > etc. etc. > > 2. UnfilteredSerializer::read[Complex|Simple]Column + DeserializationHelper > > DeserializationHelper::isDropped and > DeserializationHelper::isDroppedComplexDeletion help UnfilteredSerializer > skip data that was written *before* column drop time. > This implies that writes to dropped columns are expected at this point (or > does it?). At least, that explains why the pre-drop-writes do not cause > issues, even though ColumnFilter lets them in. > > 3. Row::Merger::ColumnDataReducer > > ColumnDataReducer has an optimization that if the schema does not contain > complex columns, the complexBuilder is not constructed (==null). > However, since we can read a write to a dropped complex column, we > eventually hit NPE in getReduced, when we attempt to use the complexBuilder. > While this NPE (CASSANDRA-18589) can be fixed easily, it suggests that we > may already be expecting no post-drop-writes at this point in the code. > > 4. Unfiltered::serializeRowBody > > serializeRowBody explicitly asserts that the SerializationHelper is aware > of a column we're about to serialize. > This is not true for writes to dropped columns, causing CASSANDRA-18591. > > There might be other places in the read path that I haven't identified, > which also make different assumptions about post-drop-writes. > > > So, how do we resolve these issues? > The most straightforward solution, which I do not discount, is to > introduce and document a system limitation: > Abstain from dropping columns or making schema changes during writes. > > But if we wish to tackle this issue programmatically, we need to consider > various scenarios during read operations: > > When performing read: > 1. the coordinator and replica share the same schema > 2. the coordinator is aware of the dropped column but not the replica > 3. the replica is aware of the dropped column, but not the coordinator > 4. the coordinator is aware of the dropped column and its subsequent > re-addition (both `columns` and `droppedColumns` in `TableMetadata` contain > a column with the same name), > 5. the coordinator is aware of the dropped column and its subsequent > column re-addition, but the replica only knows about column drop > 6. the replica is aware of the dropped column and its subsequent > re-addition, but not the coordinator is not > 7. The replica is aware of the dropped column and its subsequent > re-addition, but the coordinator is only aware of the drop > > It is unclear to me what should happen in each case, especially > considering that we want to (as I assume) avoid resurrecting writes. > I've got several ideas, but first, I'd like to confirm that my thinking > about the problem resonates with you. Does it? > > Finally, let's consider the role of Transactional Cluster Metadata (TCM). > With TCM, distinguishing and handling some of the aforementioned scenarios > may become easier or even feasible. However, my familiarity with TCM > internals and future plans is limited. Therefore, I would greatly > appreciate any insights into the relationship between TCM and scenarios > such as the ones I've described above. > > -- > Jakub Zytka > e. jakub.zy...@datastax.com > w. www.datastax.com > > -- Jakub Zytka e. jakub.zy...@datastax.com w. www.datastax.com