chewbranca commented on code in PR #5491:
URL: https://github.com/apache/couchdb/pull/5491#discussion_r2181359146


##########
src/couch_stats/src/couch_stats_sup.erl:
##########
@@ -20,15 +20,25 @@
 ]).
 
 -define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}).
+-define(CSRT_SUP, couch_stats_csrt_sup).
 
 start_link() ->
     supervisor:start_link({local, ?MODULE}, ?MODULE, []).
 
 init([]) ->
+    %% Use a dedicated CSRT supervisor with restart strategy set to transient
+    %% so that if a CSRT failure arrises that triggers the sup rate limiter
+    %% thresholds, that shutdown signal will bubble up here and be ignored,
+    %% as the use of transient specifies that `normal` and `shutdown` signals
+    %% are ignored.
+    %% Switch this to `permanent` once CSRT is out of experimental stage.
+    CSRTSup =
+        {?CSRT_SUP, {?CSRT_SUP, start_link, []}, transient, infinity, 
supervisor, [?CSRT_SUP]},

Review Comment:
   I put some docs about it in the declaration there:
   
   ```
       %% Use a dedicated CSRT supervisor with restart strategy set to transient
       %% so that if a CSRT failure arrises that triggers the sup rate limiter
       %% thresholds, that shutdown signal will bubble up here and be ignored,
       %% as the use of transient specifies that `normal` and `shutdown` signals
       %% are ignored.
       %% Switch this to `permanent` once CSRT is out of experimental stage.
   ```
   
   But over in the Erlang docs 
https://www.erlang.org/doc/system/sup_princ.html#child-specification for Child 
Specs we have for `restart=transient`:
   
   > A transient child process is restarted only if it terminates abnormally, 
that is, with an exit reason other than normal, shutdown, or {shutdown,Term}.
   
   We create a dedicated `couch_stats_csrt_sup` supervisor marked as 
`transient` for the CSRT servers so that those gen_server pids are restarted  
as you'd expect on error crashes, but in the event we hit the defined "Maximum 
Restart Intensity" ( 
https://www.erlang.org/doc/system/sup_princ.html#maximum-restart-intensity ) on 
the CSRT server pids, that triggers a shutdown on the `couch_stats_csrt_sup` 
pid as declared in the link by:
   
   > If more than `MaxR` number of restarts occur in the last `MaxT` seconds, 
the supervisor terminates all the child processes and then itself. The 
termination reason for the supervisor itself in that case will be `shutdown`.
   
   And because the `couch_stats_csrt_sup` supervisor is `transient`, we tell 
the primary `couch_stats_sup` to ignore any `shutdown` failures arising from 
restart intensity overload.
   
   The key idea here is that in the past we've seen threshold based 
overload/error scenarios where the problem only manifests once certain data 
based conditions are met, but once those conditions have been met it can result 
in a crash loop disabling the system, which can be difficult to unwind from and 
often requires a hotpatch to restore proper service. Using a `transient` 
supervisor in this manner makes the core CouchDB database operations robust to 
isolated failures in experimental systems.
   
   NOTE: I want to highlight that the use of the `transient` supervisor 
approach here is not targeted at this particular PR, but rather, IMO, this is 
how we should be enabling _all_ new experimental features until they've proven 
resilient in a wide variety of user workloads and deployment scenarios. IMO 
this `transient` approach on a dedicated supervisor should be the default 
approach used when adding new functionality, like CSRT, until the config 
toggles are switched to on by default, at which point the use of `transient` 
can be removed.
   
   The core goal being to get bug reports of the form "so hey, I noticed this 
subsystem stopped generating data/logs/etc, here's a crash report from the 
logs", versus "after XYZ happened we started hitting a crash when querying $foo 
API operations, PLS FIX ASAP!!".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to