[ https://issues.apache.org/jira/browse/CASSANDRA-18656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741215#comment-17741215 ]
Caleb Rackliffe edited comment on CASSANDRA-18656 at 7/14/23 1:32 AM: ---------------------------------------------------------------------- One way we might address this is by making sure streamed SSTables and the indexes attached to them both fall within the scope of the {{STREAM}} transaction. The current SSTable streaming process is roughly: 1.) stream an SSTable 2.) commit the streaming transaction w/ the SSTable 3.) add the SSTable to the column family 4.) via listener notification as part of 3, index the new SSTable in a blocking fashion 2 and 3 are in this order, because if 3 came before 2, the new SSTable could participate in reads, the node could die before the transaction committed, and the SSTable would be gone after restart. The problem is that if the node dies while 4 is in progress, the node will come back up thinking that the streaming operation was wholly successful, and allow startup to complete. The index in question will be rebuilt, but that rebuild will not block startup, and the index will be unusable while that happens. I propose that we move 4 between 1 and 2. This way the SSTable and related indexes are ready to query and we commit the transaction, or the transaction is simply considered failed on restart. (i.e. On restart, it would just be as the streaming had never occurred.) Doing this should make the system of marking the index unbuilt and then built again irrelevant across restart as well, although I'm not entirely sure that would roll back any of the complexity of CASSANDRA-10130 and CASSANDRA-13725. {{SecondaryIndexManager}} currently handles {{SSTableAddedNotification}} for more than just streaming, and we would have to take care that we leave those cases intact (SSTable import, etc.), although they may suffer from similar problems. EDIT: This might not be a viable solution for legacy 2i...see below... was (Author: maedhroz): One way we can address this is by making sure streamed SSTables and the indexes attached to them both fall within the scope of the {{STREAM}} transaction. The current SSTable streaming process is roughly: 1.) stream an SSTable 2.) commit the streaming transaction w/ the SSTable 3.) add the SSTable to the column family 4.) via listener notification as part of 3, index the new SSTable in a blocking fashion 2 and 3 are in this order, because if 3 came before 2, the new SSTable could participate in reads, the node could die before the transaction committed, and the SSTable would be gone after restart. The problem is that if the node dies while 4 is in progress, the node will come back up thinking that the streaming operation was wholly successful, and allow startup to complete. The index in question will be rebuilt, but that rebuild will not block startup, and the index will be unusable while that happens. I propose that we move 4 between 1 and 2. This way the SSTable and related indexes are ready to query and we commit the transaction, or the transaction is simply considered failed on restart. (i.e. On restart, it would just be as the streaming had never occurred.) Doing this should make the system of marking the index unbuilt and then built again irrelevant across restart as well, although I'm not entirely sure that would roll back any of the complexity of CASSANDRA-10130 and CASSANDRA-13725. {{SecondaryIndexManager}} currently handles {{SSTableAddedNotification}} for more than just streaming, and we would have to take care that we leave those cases intact (SSTable import, etc.), although they may suffer from similar problems. > Ensure SSTable streaming transactions do not commit before building attached > secondary indexes > ---------------------------------------------------------------------------------------------- > > Key: CASSANDRA-18656 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18656 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Streaming, Feature/2i Index, Feature/SAI, > Local/Startup and Shutdown > Reporter: Caleb Rackliffe > Assignee: Caleb Rackliffe > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.x > > Time Spent: 20m > Remaining Estimate: 0h > > Back in 2015, we identified in CASSANDRA-10130 a case where failures in 2i > builds after SSTable streaming could leave indexes in a partially built > state, even after a restart, requiring manual operator intervention. There, > and in CASSANDRA-13725, we made an attempt to remedy this situation, ensuring > that indexes would at least be rebuilt on restart after this kind of failure. > However, there are some difficulties the solution there does not address. > Let's look at a simple example... > Suppose an SSTable has been streamed to a node, and that node arrives in > {{CassandraStreamReceiver#finished()}}. We'll call {{finishTransaction()}} to > make the presence of the new SSTables durable, and then we'll call > {{ColumnFamilyStore#addSStables()}}, which add the table to the {{Tracker}}, > making it available for reads. We then notify listeners about the new > SSTable, among them the {{SecondaryIndexManager}}, which will do a blocking > index build for the new SSTable. Conceptually, at this point, we already have > a problem (if a transient one), as there are live SSTables that have not been > indexed. > What if the 2i build fails, though? Let's assume it fails because of a > disorderly (or orderly!) node shutdown. Some index implementations (SASI, > SAI) might be able to rebuild incrementally, but the legacy 2i has no way of > doing this right now. A full index rebuild on a large table could take a very > long time (days, weeks, etc.) and is ultimately not a viable way to proceed. > Let's say we were able to build incrementally though, and we had an SAI index > that did exactly this on node restart. We would still have a gap in > availability, because on startup, {{ColumnFamilyStore}} (see constructor) > does not block on its calls to {{SecondaryIndexManager#addIndex()}}, which, > via {{createIndex()}} actuate the index building process. > Of course, SAI implements a notion of "queryability" that would quickly take > the node out of rotation for queries across the cluster. Once its > initialization task runs on restart, the indexes in question would > immediately be marked non-queryable. SAI builds incrementally, and might be > able to block startup to do so in this case. Legacy 2i cannot reasonably do > this though. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org