Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

Josh McKenzie Wed, 03 Aug 2022 15:44:09 -0700

> I had to resume the bootstrap once or twice in order to get these nodes 
> finish joinning the cluster.
Was this before or after the addition of zero copy streaming? The premise is 
that the pain point resumable bootstrap targets is mitigated by the much faster 
bootstrapping times without the correctness risks.


On Wed, Aug 3, 2022, at 6:21 PM, Bowen Song via dev wrote:
> That would have to be assessed on a case by case basis.
> 
> * When the code doesn't delete data, which means there's a zero probability 
> of resurrecting deleted data, I will still use resumable bootstrap.
> 
> * When resurrected data doesn't pose a problem to the system, it often can 
> still be an acceptable behaviour to save hours or days of bootstrapping time. 
> I may use resumable bootstrap.
> 
> * In other cases, where data correctness is important and there's a chance 
> for resurrecting deleted data, I would certainly not use it if I had known it 
> in advance (which I don't).
> 
> 
> 
> On 03/08/2022 23:11, Jeff Jirsa wrote:
>> The hypothetical concern described is around potential data resurrection - 
>> would you still use resumable bootstrap if you knew that data deleted during 
>> those STW pauses was improperly resurrected? 
>> 
>> On Wed, Aug 3, 2022 at 2:40 PM Bowen Song via dev <dev@cassandra.apache.org> 
>> wrote:
>>> I have benefited from the resumable bootstrap before, and I'm in favour of 
>>> keeping the feature around.
>>> 
>>> I've had streaming failures due to long STW GC pauses on some bootstrapping 
>>> nodes, and I had to resume the bootstrap once or twice in order to get 
>>> these nodes finish joinning the cluster. They had not experienced more long 
>>> STW GC pauses since they joined the cluster. I would imagine I will spend a 
>>> lots of time tuning the GC parameters in order get these nodes to join if 
>>> the resumable bootstrapping feature is removed. Also, I'm not concerned 
>>> about racing conditions involving repairs, because we don't run repairs 
>>> while we are adding new nodes (to minimize the additional load on the 
>>> cluster).
>>> 
>>> 
>>> 
>>> On 03/08/2022 19:46, Josh McKenzie wrote:
>>>> Context: https://issues.apache.org/jira/browse/CASSANDRA-17679
>>>> 
>>>> From the .yaml comment on the param I was working on adding:
>>>> In certain environments, operators may want to disable resumable bootstrap 
>>>> in order to avoid potential correctness violations or data loss scenarios. 
>>>> Largely this centers around nodes going down during bootstrap, tombstones 
>>>> being written, and potential races with repair. By default we leave this 
>>>> on as it's been enabled for quite some time, however the option to disable 
>>>> it is more palatable now that we have zero copy streaming as that greatly 
>>>> accelerates
>>>> 
>>>> 
>>>> Given zero copy streaming in the system and the general unexplored 
>>>> correctness concerns of 
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8838, specifically pointed 
>>>> out by Jeff here: 
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234,
>>>>  I've been chatting w/Paulo about this and we've both concluded we think 
>>>> the functionality should be made configurable, default off (?), deprecated 
>>>> in 4.2 and then completely removed next.
>>>> 
>>>> - First: anyone have any concerns with the general arc of "remove 
>>>> resumable bootstrap and decommission"?
>>>> - Second: Should we leave them enabled by default in 4.2 or disabled?
>>>> - Third: Should we consider revisiting older branches with this 
>>>> functionality and making it toggle-able?
>>>> 
>>>> ~Josh

Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

Reply via email to