Re: Switching to Incremental Repair

Bowen Song via user Thu, 15 Feb 2024 12:10:52 -0800

The gc_grace_seconds, which default to 10 days, is the maximal safeinterval between repairs. How much data gets written during that periodof time? Will your nodes run out of disk space because of the new datawritten during that time? If so, it sounds like your nodes aredangerously close to running out of disk space, and you should addressthat issue first before even considering upgrading Cassandra.


On 15/02/2024 18:49, Kristijonas Zalys wrote:

Hi folks,


One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incrementalrepair on a cluster (e.g.: during a Cassandra major version upgrade)?My understanding is that if we simply stop running incremental repair,the cluster's nodes can, in the worst case, double in disk size as therepaired dataset will not get compacted with the unrepaired dataset.Similar to Sebastian, we have nodes where the disk usage is multipleTiBs so significant growth can be quite dangerous in our case. Wouldthe only safe choice be to mark all SSTables as unrepaired beforestopping regular incremental repair?


Thanks,
Kristijonas

On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user<user@cassandra.apache.org> wrote:


    The over-streaming is only problematic for the repaired SSTables,
    but it
    can be triggered by inconsistencies within the unrepaired SSTables
    during an incremental repair session. This is because although an
    incremental repair will only compare the unrepaired SSTables, but it
    will stream both the unrepaired and repaired SSTables for the
    inconsistent token ranges. Keep in mind that the source SSTables for
    streaming is selected based on the token ranges, not the
    repaired/unrepaired state.

    Base on the above, I'm unsure running an incremental repair before a
    full repair can fully avoid the over-streaming issue.

    On 07/02/2024 22:41, Sebastian Marsching wrote:
    > Thank you very much for your explanation.
    >
    > Streaming happens on the token range level, not the SSTable
    level, right? So, when running an incremental repair before the
    full repair, the problem that “some unrepaired SSTables are being
    marked as repaired on one node but not on another” should not
    exist any longer. Now this data should be marked as repaired on
    all nodes.
    >
    > Thus, when repairing the SSTables that are marked as repaired,
    this data should be included on all nodes when calculating the
    Merkle trees and no overstreaming should happen.
    >
    > Of course, this means that running an incremental repair *first*
    after marking SSTables as repaired and only running the full
    repair *after* that is critical. I have to admit that previously I
    wasn’t fully aware of how critical this step is.
    >
    >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user
    <user@cassandra.apache.org>:
    >>
    >> Unfortunately repair doesn't compare each partition
    individually. Instead, it groups multiple partitions together and
    calculate a hash of them, stores the hash in a leaf of a merkle
    tree, and then compares the merkle trees between replicas during a
    repair session. If any one of the partitions covered by a leaf is
    inconsistent between replicas, the hash values in these leaves
    will be different, and all partitions covered by the same leaf
    will need to be streamed in full.
    >>
    >> Knowing that, and also know that your approach can create a
    lots of inconsistencies in the repaired SSTables because some
    unrepaired SSTables are being marked as repaired on one node but
    not on another, you would then understand why over-streaming can
    happen. The over-streaming is only problematic for the repaired
    SSTables, because they are much bigger than the unrepaired.
    >>
    >>
    >> On 07/02/2024 17:00, Sebastian Marsching wrote:
    >>>> Caution, using the method you described, the amount of data
    streamed at the end with the full repair is not the amount of data
    written between stopping the first node and the last node, but
    depends on the table size, the number of partitions written, their
    distribution in the ring and the 'repair_session_space' value. If
    the table is large, the writes touch a large number of partitions
    scattered across the token ring, and the value of
    'repair_session_space' is small, you may end up with a very
    expensive over-streaming.
    >>> Thanks for the warning. In our case it worked well (obviously
    we tested it on a test cluster before applying it on the
    production clusters), but it is good to know that this might not
    always be the case.
    >>>
    >>> Maybe I misunderstand how full and incremental repairs work in
    C* 4.x. I would appreciate if you could clarify this for me.
    >>>
    >>> So far, I assumed that a full repair on a cluster that is also
    using incremental repair pretty much works like on a cluster that
    is not using incremental repair at all, the only difference being
    that the set of repaired und unrepaired data is repaired
    separately, so the Merkle trees that are calculated for repaired
    and unrepaired data are completely separate.
    >>>
    >>> I also assumed that incremental repair only looks at
    unrepaired data, which is why it is so fast.
    >>>
    >>> Is either of these two assumptions wrong?
    >>>
    >>> If not, I do not quite understand how a lot of overstreaming
    might happen, as long as (I forgot to mention this step in my
    original e-mail) I run an incremental repair directly after
    restarting the nodes and marking all data as repaired.
    >>>
    >>> I understand that significant overstreaming might happen
    during this first repair (in the worst case streaming all the
    unrepaired data that a node stores), but due to the short amount
    of time between starting to mark data as repaired and running the
    incremental repair, the whole set of unrepaired data should be
    rather small, so this overstreaming should not cause any issues.
    >>>
    >>>  From this point on, the unrepaired data on the different
    nodes should be in sync and discrepancies in the repaired data
    during the full repair should not be bigger than they had been if
    I had run a full repair without marking an data as repaired.
    >>>
    >>> I would really appreciate if you could point out the hole in
    this reasoning. Maybe I have a fundamentally wrong understanding
    of the repair process, and if I do I would like to correct this.
    >>>

Re: Switching to Incremental Repair

Reply via email to