[GitHub] [flink-kubernetes-operator] mxm commented on a diff in pull request #549: [FLINK-31502] Limit the number of scale operations to reduce cluster churn

via GitHub Mon, 20 Mar 2023 02:37:19 -0700


mxm commented on code in PR #549:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/549#discussion_r1141846057



##########
docs/layouts/shortcodes/generated/dynamic_section.html:
##########
@@ -86,6 +86,12 @@
             <td>Duration</td>
             <td>Interval at which periodic savepoints will be triggered. The 
triggering schedule is not guaranteed, savepoints will be triggered as part of 
the regular reconcile loop.</td>
         </tr>
+        <tr>
+            <td><h5>kubernetes.operator.rescaling.cluster-cooldown</h5></td>
+            <td style="word-wrap: break-word;">1 min</td>
+            <td>Duration</td>

Review Comment:
   This one is tricky. I would rather like to be conservative here because 
concurrent scaling operations can create a lot of unexpected churn in the 
cluster. Users have seen cost increase from resource spikes which can paralyze 
the entire cluster and lead to adding more k8s nodes (if the cluster autoscaler 
is active). Users may always change this default. 
   
   Beyond this, we may add something like a concurrency parameter which allows 
to define how many pipelines are allowed to scale before the cool down. 
Additionally, checking the actual allocatable resources before scaling is 
something we need to look into.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-kubernetes-operator] mxm commented on a diff in pull request #549: [FLINK-31502] Limit the number of scale operations to reduce cluster churn

Reply via email to