+1 from me too. Moreover, this work has started as part of the test efforts
and identifying weak points during the 4.0 testing, if I recall correctly.
5.0 sounds like a good place to land. Thank you David and everyone else
involved for your efforts!

On Thu, 21 Sep 2023 at 1:01, Berenguer Blasi <berenguerbl...@gmail.com>
wrote:

> +1 I agree with Brandon. It's more like a bug imo.
> On 20/9/23 21:42, Caleb Rackliffe wrote:
>
> +1 on a 5.0 backport
>
> On Wed, Sep 20, 2023 at 2:26 PM Brandon Williams <dri...@gmail.com> wrote:
>
>> I think it could be argued that not retrying messages is a bug, I am
>> +1 on including this in 5.0.
>>
>> Kind Regards,
>> Brandon
>>
>> On Tue, Sep 19, 2023 at 1:16 PM David Capwell <dcapw...@apple.com> wrote:
>> >
>> > To try to get repair more stable, I added optional retry logic (patch
>> is still in review) to a handful of critical repair verbs.  This patch is
>> disabled by default but allows you to opt-in to retries so ephemeral issues
>> don’t cause a repair to fail after running for a long time (assuming they
>> resolve within the retry window). There are 2 protocol level changes to
>> enable this: VALIDATION_RSP and SYNC_RSP now send an ACK (if the sender
>> doesn’t attach a callback, these ACKs get ignored in all versions; see
>> org.apache.cassandra.net.ResponseVerbHandler#doVerb and
>> Verb.REPAIR_RSP).  Given that we have already forked, I believe we would
>> need to give a waiver to allow this patch due to this change.
>> >
>> > The patch was written on trunk, but figured back porting 5.0 would be
>> rather trivial and this was brought up during the review, so floating this
>> to a wider audience.
>> >
>> > If you look at the patch you will see that it is very large, but this
>> is only to make testing of repair coordination easier and deterministic,
>> the biggest code changes are:
>> >
>> > 1) Moving from ActiveRepairService.instance to
>> ActiveRepairService.instance() (this is the main reason so many files were
>> touched; this was needed so unit tests don’t load the whole world)
>> > 2) Repair no longer reaches into global space and instead is provided
>> the subsystems needed to perform repair; this change is local to repair code
>> >
>> > Both of these changes were only for testing as they allow us to
>> simulate 1k repairs in around 15 seconds with 100% deterministic execution.
>>
>

Reply via email to