Re: Switching to Incremental Repair

Chris Lohfink Thu, 15 Feb 2024 12:18:19 -0800

I would recommend adding something to C* to be able to flip the repaired
state on all sstables quickly (with default OSS can turn nodes off one at a
time and use sstablerepairedset). It's a life saver to be able to revert
back to non-IR if migration going south. Same can be used to quickly switch
into IR sstables with more caveats. Probably worth a jira to add a faster
solution


On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys <kza...@gmail.com> wrote:

> Hi folks,
>
> One last question regarding incremental repair.
>
> What would be a safe approach to temporarily stop running incremental
> repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
> understanding is that if we simply stop running incremental repair, the
> cluster's nodes can, in the worst case, double in disk size as the repaired
> dataset will not get compacted with the unrepaired dataset. Similar to
> Sebastian, we have nodes where the disk usage is multiple TiBs so
> significant growth can be quite dangerous in our case. Would the only safe
> choice be to mark all SSTables as unrepaired before stopping regular
> incremental repair?
>
> Thanks,
> Kristijonas
>
>
> On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The over-streaming is only problematic for the repaired SSTables, but it
>> can be triggered by inconsistencies within the unrepaired SSTables
>> during an incremental repair session. This is because although an
>> incremental repair will only compare the unrepaired SSTables, but it
>> will stream both the unrepaired and repaired SSTables for the
>> inconsistent token ranges. Keep in mind that the source SSTables for
>> streaming is selected based on the token ranges, not the
>> repaired/unrepaired state.
>>
>> Base on the above, I'm unsure running an incremental repair before a
>> full repair can fully avoid the over-streaming issue.
>>
>> On 07/02/2024 22:41, Sebastian Marsching wrote:
>> > Thank you very much for your explanation.
>> >
>> > Streaming happens on the token range level, not the SSTable level,
>> right? So, when running an incremental repair before the full repair, the
>> problem that “some unrepaired SSTables are being marked as repaired on one
>> node but not on another” should not exist any longer. Now this data should
>> be marked as repaired on all nodes.
>> >
>> > Thus, when repairing the SSTables that are marked as repaired, this
>> data should be included on all nodes when calculating the Merkle trees and
>> no overstreaming should happen.
>> >
>> > Of course, this means that running an incremental repair *first* after
>> marking SSTables as repaired and only running the full repair *after* that
>> is critical. I have to admit that previously I wasn’t fully aware of how
>> critical this step is.
>> >
>> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
>> user@cassandra.apache.org>:
>> >>
>> >> Unfortunately repair doesn't compare each partition individually.
>> Instead, it groups multiple partitions together and calculate a hash of
>> them, stores the hash in a leaf of a merkle tree, and then compares the
>> merkle trees between replicas during a repair session. If any one of the
>> partitions covered by a leaf is inconsistent between replicas, the hash
>> values in these leaves will be different, and all partitions covered by the
>> same leaf will need to be streamed in full.
>> >>
>> >> Knowing that, and also know that your approach can create a lots of
>> inconsistencies in the repaired SSTables because some unrepaired SSTables
>> are being marked as repaired on one node but not on another, you would then
>> understand why over-streaming can happen. The over-streaming is only
>> problematic for the repaired SSTables, because they are much bigger than
>> the unrepaired.
>> >>
>> >>
>> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>> >>>> Caution, using the method you described, the amount of data streamed
>> at the end with the full repair is not the amount of data written between
>> stopping the first node and the last node, but depends on the table size,
>> the number of partitions written, their distribution in the ring and the
>> 'repair_session_space' value. If the table is large, the writes touch a
>> large number of partitions scattered across the token ring, and the value
>> of 'repair_session_space' is small, you may end up with a very expensive
>> over-streaming.
>> >>> Thanks for the warning. In our case it worked well (obviously we
>> tested it on a test cluster before applying it on the production clusters),
>> but it is good to know that this might not always be the case.
>> >>>
>> >>> Maybe I misunderstand how full and incremental repairs work in C*
>> 4.x. I would appreciate if you could clarify this for me.
>> >>>
>> >>> So far, I assumed that a full repair on a cluster that is also using
>> incremental repair pretty much works like on a cluster that is not using
>> incremental repair at all, the only difference being that the set of
>> repaired und unrepaired data is repaired separately, so the Merkle trees
>> that are calculated for repaired and unrepaired data are completely
>> separate.
>> >>>
>> >>> I also assumed that incremental repair only looks at unrepaired data,
>> which is why it is so fast.
>> >>>
>> >>> Is either of these two assumptions wrong?
>> >>>
>> >>> If not, I do not quite understand how a lot of overstreaming might
>> happen, as long as (I forgot to mention this step in my original e-mail) I
>> run an incremental repair directly after restarting the nodes and marking
>> all data as repaired.
>> >>>
>> >>> I understand that significant overstreaming might happen during this
>> first repair (in the worst case streaming all the unrepaired data that a
>> node stores), but due to the short amount of time between starting to mark
>> data as repaired and running the incremental repair, the whole set of
>> unrepaired data should be rather small, so this overstreaming should not
>> cause any issues.
>> >>>
>> >>>  From this point on, the unrepaired data on the different nodes
>> should be in sync and discrepancies in the repaired data during the full
>> repair should not be bigger than they had been if I had run a full repair
>> without marking an data as repaired.
>> >>>
>> >>> I would really appreciate if you could point out the hole in this
>> reasoning. Maybe I have a fundamentally wrong understanding of the repair
>> process, and if I do I would like to correct this.
>> >>>
>>
>

Re: Switching to Incremental Repair

Reply via email to