[jira] [Comment Edited] (CASSANDRA-18656) Ensure SSTable streaming transactions do not commit before building attached secondary indexes

Caleb Rackliffe (Jira) Thu, 13 Jul 2023 18:34:42 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741215#comment-17741215
 ]


Caleb Rackliffe edited comment on CASSANDRA-18656 at 7/14/23 1:32 AM:
----------------------------------------------------------------------

One way we might address this is by making sure streamed SSTables and the 
indexes attached to them both fall within the scope of the {{STREAM}} 
transaction.

The current SSTable streaming process is roughly:

1.) stream an SSTable
2.) commit the streaming transaction w/ the SSTable
3.) add the SSTable to the column family
4.) via listener notification as part of 3, index the new SSTable in a blocking 
fashion

2 and 3 are in this order, because if 3 came before 2, the new SSTable could 
participate in reads, the node could die before the transaction committed, and 
the SSTable would be gone after restart. The problem is that if the node dies 
while 4 is in progress, the node will come back up thinking that the streaming 
operation was wholly successful, and allow startup to complete. The index in 
question will be rebuilt, but that rebuild will not block startup, and the 
index will be unusable while that happens.

I propose that we move 4 between 1 and 2. This way the SSTable and related 
indexes are ready to query and we commit the transaction, or the transaction is 
simply considered failed on restart. (i.e. On restart, it would just be as the 
streaming had never occurred.) Doing this should make the system of marking the 
index unbuilt and then built again irrelevant across restart as well, although 
I'm not entirely sure that would roll back any of the complexity of 
CASSANDRA-10130 and CASSANDRA-13725. {{SecondaryIndexManager}} currently 
handles {{SSTableAddedNotification}} for more than just streaming, and we would 
have to take care that we leave those cases intact (SSTable import, etc.), 
although they may suffer from similar problems.

EDIT: This might not be a viable solution for legacy 2i...see below...


was (Author: maedhroz):
One way we can address this is by making sure streamed SSTables and the indexes 
attached to them both fall within the scope of the {{STREAM}} transaction.

The current SSTable streaming process is roughly:

1.) stream an SSTable
2.) commit the streaming transaction w/ the SSTable
3.) add the SSTable to the column family
4.) via listener notification as part of 3, index the new SSTable in a blocking 
fashion

2 and 3 are in this order, because if 3 came before 2, the new SSTable could 
participate in reads, the node could die before the transaction committed, and 
the SSTable would be gone after restart. The problem is that if the node dies 
while 4 is in progress, the node will come back up thinking that the streaming 
operation was wholly successful, and allow startup to complete. The index in 
question will be rebuilt, but that rebuild will not block startup, and the 
index will be unusable while that happens.

I propose that we move 4 between 1 and 2. This way the SSTable and related 
indexes are ready to query and we commit the transaction, or the transaction is 
simply considered failed on restart. (i.e. On restart, it would just be as the 
streaming had never occurred.) Doing this should make the system of marking the 
index unbuilt and then built again irrelevant across restart as well, although 
I'm not entirely sure that would roll back any of the complexity of 
CASSANDRA-10130 and CASSANDRA-13725. {{SecondaryIndexManager}} currently 
handles {{SSTableAddedNotification}} for more than just streaming, and we would 
have to take care that we leave those cases intact (SSTable import, etc.), 
although they may suffer from similar problems.

> Ensure SSTable streaming transactions do not commit before building attached 
> secondary indexes
> ----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18656
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18656
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Streaming, Feature/2i Index, Feature/SAI, 
> Local/Startup and Shutdown
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.x
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Back in 2015, we identified in CASSANDRA-10130 a case where failures in 2i 
> builds after SSTable streaming could leave indexes in a partially built 
> state, even after a restart, requiring manual operator intervention. There, 
> and in CASSANDRA-13725, we made an attempt to remedy this situation, ensuring 
> that indexes would at least be rebuilt on restart after this kind of failure. 
> However, there are some difficulties the solution there does not address.
> Let's look at a simple example...
> Suppose an SSTable has been streamed to a node, and that node arrives in 
> {{CassandraStreamReceiver#finished()}}. We'll call {{finishTransaction()}} to 
> make the presence of the new SSTables durable, and then we'll call 
> {{ColumnFamilyStore#addSStables()}}, which add the table to the {{Tracker}}, 
> making it available for reads. We then notify listeners about the new 
> SSTable, among them the {{SecondaryIndexManager}}, which will do a blocking 
> index build for the new SSTable. Conceptually, at this point, we already have 
> a problem (if a transient one), as there are live SSTables that have not been 
> indexed.
> What if the 2i build fails, though? Let's assume it fails because of a 
> disorderly (or orderly!) node shutdown. Some index implementations (SASI, 
> SAI) might be able to rebuild incrementally, but the legacy 2i has no way of 
> doing this right now. A full index rebuild on a large table could take a very 
> long time (days, weeks, etc.) and is ultimately not a viable way to proceed. 
> Let's say we were able to build incrementally though, and we had an SAI index 
> that did exactly this on node restart. We would still have a gap in 
> availability, because on startup, {{ColumnFamilyStore}} (see constructor) 
> does not block on its calls to {{SecondaryIndexManager#addIndex()}}, which, 
> via {{createIndex()}} actuate the index building process.
> Of course, SAI implements a notion of "queryability" that would quickly take 
> the node out of rotation for queries across the cluster. Once its 
> initialization task runs on restart, the indexes in question would 
> immediately be marked non-queryable. SAI builds incrementally, and might be 
> able to block startup to do so in this case. Legacy 2i cannot reasonably do 
> this though.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18656) Ensure SSTable streaming transactions do not commit before building attached secondary indexes

Reply via email to