I have benefited from the resumable bootstrap before, and I'm in favour of keeping the feature around.

I've had streaming failures due to long STW GC pauses on some bootstrapping nodes, and I had to resume the bootstrap once or twice in order to get these nodes finish joinning the cluster. They had not experienced more long STW GC pauses since they joined the cluster. I would imagine I will spend a lots of time tuning the GC parameters in order get these nodes to join if the resumable bootstrapping feature is removed. Also, I'm not concerned about racing conditions involving repairs, because we don't run repairs while we are adding new nodes (to minimize the additional load on the cluster).


On 03/08/2022 19:46, Josh McKenzie wrote:
Context: https://issues.apache.org/jira/browse/CASSANDRA-17679

From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap in 
order to avoid potential correctness violations or data loss scenarios. 
Largelythis  centers around nodes going down during bootstrap, tombstones being 
written, and potential races with repair. Bydefault  we leavethis  on as it's 
been enabledfor  quite some time, however the option to disable it is more 
palatable now that we have zero copy streaming as that greatly accelerates

Given zero copy streaming in the system and the general unexplored correctness concerns of https://issues.apache.org/jira/browse/CASSANDRA-8838, specifically pointed out by Jeff here: https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234 <https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>, I've been chatting w/Paulo about this and we've both concluded we think the functionality should be made configurable, default off (?), deprecated in 4.2 and then completely removed next.

- First: anyone have any concerns with the general arc of "remove resumable bootstrap and decommission"?
- Second: Should we leave them enabled by default in 4.2 or disabled?
- Third: Should we consider revisiting older branches with this functionality and making it toggle-able?

~Josh

Reply via email to