Thanks all for the feedback!  The patch has 2 +1s on trunk and back ported to 
5.0, making sure it’s stable now; I plan to merge early this week.

> On Sep 21, 2023, at 2:07 PM, Ekaterina Dimitrova <e.dimitr...@gmail.com> 
> wrote:
> 
> +1 from me too. Moreover, this work has started as part of the test efforts 
> and identifying weak points during the 4.0 testing, if I recall correctly. 
> 5.0 sounds like a good place to land. Thank you David and everyone else 
> involved for your efforts!
> 
> On Thu, 21 Sep 2023 at 1:01, Berenguer Blasi <berenguerbl...@gmail.com 
> <mailto:berenguerbl...@gmail.com>> wrote:
>> +1 I agree with Brandon. It's more like a bug imo.
>> On 20/9/23 21:42, Caleb Rackliffe wrote:
>>> +1 on a 5.0 backport
>>> 
>>> On Wed, Sep 20, 2023 at 2:26 PM Brandon Williams <dri...@gmail.com 
>>> <mailto:dri...@gmail.com>> wrote:
>>>> I think it could be argued that not retrying messages is a bug, I am
>>>> +1 on including this in 5.0.
>>>> 
>>>> Kind Regards,
>>>> Brandon
>>>> 
>>>> On Tue, Sep 19, 2023 at 1:16 PM David Capwell <dcapw...@apple.com 
>>>> <mailto:dcapw...@apple.com>> wrote:
>>>> >
>>>> > To try to get repair more stable, I added optional retry logic (patch is 
>>>> > still in review) to a handful of critical repair verbs.  This patch is 
>>>> > disabled by default but allows you to opt-in to retries so ephemeral 
>>>> > issues don’t cause a repair to fail after running for a long time 
>>>> > (assuming they resolve within the retry window). There are 2 protocol 
>>>> > level changes to enable this: VALIDATION_RSP and SYNC_RSP now send an 
>>>> > ACK (if the sender doesn’t attach a callback, these ACKs get ignored in 
>>>> > all versions; see org.apache.cassandra.net 
>>>> > <http://org.apache.cassandra.net/>.ResponseVerbHandler#doVerb and 
>>>> > Verb.REPAIR_RSP).  Given that we have already forked, I believe we would 
>>>> > need to give a waiver to allow this patch due to this change.
>>>> >
>>>> > The patch was written on trunk, but figured back porting 5.0 would be 
>>>> > rather trivial and this was brought up during the review, so floating 
>>>> > this to a wider audience.
>>>> >
>>>> > If you look at the patch you will see that it is very large, but this is 
>>>> > only to make testing of repair coordination easier and deterministic, 
>>>> > the biggest code changes are:
>>>> >
>>>> > 1) Moving from ActiveRepairService.instance to 
>>>> > ActiveRepairService.instance() (this is the main reason so many files 
>>>> > were touched; this was needed so unit tests don’t load the whole world)
>>>> > 2) Repair no longer reaches into global space and instead is provided 
>>>> > the subsystems needed to perform repair; this change is local to repair 
>>>> > code
>>>> >
>>>> > Both of these changes were only for testing as they allow us to simulate 
>>>> > 1k repairs in around 15 seconds with 100% deterministic execution.

Reply via email to