chewbranca commented on code in PR #5491:
URL: https://github.com/apache/couchdb/pull/5491#discussion_r2181359146
##########
src/couch_stats/src/couch_stats_sup.erl:
##########
@@ -20,15 +20,25 @@
]).
-define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}).
+-define(CSRT_SUP, couch_stats_csrt_sup).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
+ %% Use a dedicated CSRT supervisor with restart strategy set to transient
+ %% so that if a CSRT failure arrises that triggers the sup rate limiter
+ %% thresholds, that shutdown signal will bubble up here and be ignored,
+ %% as the use of transient specifies that `normal` and `shutdown` signals
+ %% are ignored.
+ %% Switch this to `permanent` once CSRT is out of experimental stage.
+ CSRTSup =
+ {?CSRT_SUP, {?CSRT_SUP, start_link, []}, transient, infinity,
supervisor, [?CSRT_SUP]},
Review Comment:
I put some docs about it in the declaration there:
```
%% Use a dedicated CSRT supervisor with restart strategy set to transient
%% so that if a CSRT failure arrises that triggers the sup rate limiter
%% thresholds, that shutdown signal will bubble up here and be ignored,
%% as the use of transient specifies that `normal` and `shutdown` signals
%% are ignored.
%% Switch this to `permanent` once CSRT is out of experimental stage.
```
But over in the Erlang docs
https://www.erlang.org/doc/system/sup_princ.html#child-specification for Child
Specs we have for `restart=transient`:
> A transient child process is restarted only if it terminates abnormally,
that is, with an exit reason other than normal, shutdown, or {shutdown,Term}.
We create a dedicated `couch_stats_csrt_sup` supervisor marked as
`transient` for the CSRT servers so that those gen_server pids are restarted
as you'd expect on error crashes, but in the event we hit the defined "Maximum
Restart Intensity" (
https://www.erlang.org/doc/system/sup_princ.html#maximum-restart-intensity ) on
the CSRT server pids, that triggers a shutdown on the `couch_stats_csrt_sup`
pid as declared in the link by:
> If more than `MaxR` number of restarts occur in the last `MaxT` seconds,
the supervisor terminates all the child processes and then itself. The
termination reason for the supervisor itself in that case will be `shutdown`.
And because the `couch_stats_csrt_sup` supervisor is `transient`, we tell
the primary `couch_stats_sup` to ignore any `shutdown` failures arising from
restart intensity overload.
The key idea here is that in the past we've seen threshold based
overload/error scenarios where the problem only manifests once certain data
based conditions are met, but once those conditions have been met it can result
in a crash loop disabling the system, which can be difficult to unwind from and
often requires a hotpatch to restore proper service. Using a `transient`
supervisor in this manner makes the core CouchDB database operations robust to
isolated failures in experimental systems.
NOTE: I want to highlight that the use of the `transient` supervisor
approach here is not targeted at this particular PR, but rather, IMO, this is
how we should be enabling _all_ new experimental features until they've proven
resilient in a wide variety of user workloads and deployment scenarios. IMO
this `transient` approach on a dedicated supervisor should be the default
approach used when adding new functionality, like CSRT, until the config
toggles are switched to on by default, at which point the use of `transient`
can be removed.
The core goal being to get bug reports of the form "so hey, I noticed this
subsystem stopped generating data/logs/etc, here's a crash report from the
logs", versus "after XYZ happened we started hitting a crash when querying $foo
API operations, PLS FIX ASAP!!".
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]