I have benefited from the resumable bootstrap before, and I'm in favour
of keeping the feature around.
I've had streaming failures due to long STW GC pauses on some
bootstrapping nodes, and I had to resume the bootstrap once or twice in
order to get these nodes finish joinning the cluster. They had not
experienced more long STW GC pauses since they joined the cluster. I
would imagine I will spend a lots of time tuning the GC parameters in
order get these nodes to join if the resumable bootstrapping feature is
removed. Also, I'm not concerned about racing conditions involving
repairs, because we don't run repairs while we are adding new nodes (to
minimize the additional load on the cluster).
On 03/08/2022 19:46, Josh McKenzie wrote:
Context: https://issues.apache.org/jira/browse/CASSANDRA-17679
From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap in
order to avoid potential correctness violations or data loss scenarios.
Largelythis centers around nodes going down during bootstrap, tombstones being
written, and potential races with repair. Bydefault we leavethis on as it's
been enabledfor quite some time, however the option to disable it is more
palatable now that we have zero copy streaming as that greatly accelerates
Given zero copy streaming in the system and the general unexplored
correctness concerns of
https://issues.apache.org/jira/browse/CASSANDRA-8838, specifically
pointed out by Jeff here:
https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234
<https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>, I've
been chatting w/Paulo about this and we've both concluded we think the
functionality should be made configurable, default off (?), deprecated
in 4.2 and then completely removed next.
- First: anyone have any concerns with the general arc of "remove
resumable bootstrap and decommission"?
- Second: Should we leave them enabled by default in 4.2 or disabled?
- Third: Should we consider revisiting older branches with this
functionality and making it toggle-able?
~Josh