I would recommend adding something to C* to be able to flip the repaired state on all sstables quickly (with default OSS can turn nodes off one at a time and use sstablerepairedset). It's a life saver to be able to revert back to non-IR if migration going south. Same can be used to quickly switch into IR sstables with more caveats. Probably worth a jira to add a faster solution
On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys <kza...@gmail.com> wrote: > Hi folks, > > One last question regarding incremental repair. > > What would be a safe approach to temporarily stop running incremental > repair on a cluster (e.g.: during a Cassandra major version upgrade)? My > understanding is that if we simply stop running incremental repair, the > cluster's nodes can, in the worst case, double in disk size as the repaired > dataset will not get compacted with the unrepaired dataset. Similar to > Sebastian, we have nodes where the disk usage is multiple TiBs so > significant growth can be quite dangerous in our case. Would the only safe > choice be to mark all SSTables as unrepaired before stopping regular > incremental repair? > > Thanks, > Kristijonas > > > On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user < > user@cassandra.apache.org> wrote: > >> The over-streaming is only problematic for the repaired SSTables, but it >> can be triggered by inconsistencies within the unrepaired SSTables >> during an incremental repair session. This is because although an >> incremental repair will only compare the unrepaired SSTables, but it >> will stream both the unrepaired and repaired SSTables for the >> inconsistent token ranges. Keep in mind that the source SSTables for >> streaming is selected based on the token ranges, not the >> repaired/unrepaired state. >> >> Base on the above, I'm unsure running an incremental repair before a >> full repair can fully avoid the over-streaming issue. >> >> On 07/02/2024 22:41, Sebastian Marsching wrote: >> > Thank you very much for your explanation. >> > >> > Streaming happens on the token range level, not the SSTable level, >> right? So, when running an incremental repair before the full repair, the >> problem that “some unrepaired SSTables are being marked as repaired on one >> node but not on another” should not exist any longer. Now this data should >> be marked as repaired on all nodes. >> > >> > Thus, when repairing the SSTables that are marked as repaired, this >> data should be included on all nodes when calculating the Merkle trees and >> no overstreaming should happen. >> > >> > Of course, this means that running an incremental repair *first* after >> marking SSTables as repaired and only running the full repair *after* that >> is critical. I have to admit that previously I wasn’t fully aware of how >> critical this step is. >> > >> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user < >> user@cassandra.apache.org>: >> >> >> >> Unfortunately repair doesn't compare each partition individually. >> Instead, it groups multiple partitions together and calculate a hash of >> them, stores the hash in a leaf of a merkle tree, and then compares the >> merkle trees between replicas during a repair session. If any one of the >> partitions covered by a leaf is inconsistent between replicas, the hash >> values in these leaves will be different, and all partitions covered by the >> same leaf will need to be streamed in full. >> >> >> >> Knowing that, and also know that your approach can create a lots of >> inconsistencies in the repaired SSTables because some unrepaired SSTables >> are being marked as repaired on one node but not on another, you would then >> understand why over-streaming can happen. The over-streaming is only >> problematic for the repaired SSTables, because they are much bigger than >> the unrepaired. >> >> >> >> >> >> On 07/02/2024 17:00, Sebastian Marsching wrote: >> >>>> Caution, using the method you described, the amount of data streamed >> at the end with the full repair is not the amount of data written between >> stopping the first node and the last node, but depends on the table size, >> the number of partitions written, their distribution in the ring and the >> 'repair_session_space' value. If the table is large, the writes touch a >> large number of partitions scattered across the token ring, and the value >> of 'repair_session_space' is small, you may end up with a very expensive >> over-streaming. >> >>> Thanks for the warning. In our case it worked well (obviously we >> tested it on a test cluster before applying it on the production clusters), >> but it is good to know that this might not always be the case. >> >>> >> >>> Maybe I misunderstand how full and incremental repairs work in C* >> 4.x. I would appreciate if you could clarify this for me. >> >>> >> >>> So far, I assumed that a full repair on a cluster that is also using >> incremental repair pretty much works like on a cluster that is not using >> incremental repair at all, the only difference being that the set of >> repaired und unrepaired data is repaired separately, so the Merkle trees >> that are calculated for repaired and unrepaired data are completely >> separate. >> >>> >> >>> I also assumed that incremental repair only looks at unrepaired data, >> which is why it is so fast. >> >>> >> >>> Is either of these two assumptions wrong? >> >>> >> >>> If not, I do not quite understand how a lot of overstreaming might >> happen, as long as (I forgot to mention this step in my original e-mail) I >> run an incremental repair directly after restarting the nodes and marking >> all data as repaired. >> >>> >> >>> I understand that significant overstreaming might happen during this >> first repair (in the worst case streaming all the unrepaired data that a >> node stores), but due to the short amount of time between starting to mark >> data as repaired and running the incremental repair, the whole set of >> unrepaired data should be rather small, so this overstreaming should not >> cause any issues. >> >>> >> >>> From this point on, the unrepaired data on the different nodes >> should be in sync and discrepancies in the repaired data during the full >> repair should not be bigger than they had been if I had run a full repair >> without marking an data as repaired. >> >>> >> >>> I would really appreciate if you could point out the hole in this >> reasoning. Maybe I have a fundamentally wrong understanding of the repair >> process, and if I do I would like to correct this. >> >>> >> >