[GH] (flink-kubernetes-operator): Workflow run "CI" failed!

GitBox Mon, 08 Jan 2024 11:49:48 -0800


The GitHub Actions job "CI" on flink-kubernetes-operator.git has failed.
Run started by GitHub user mxm (triggered by mxm).


Head commit for run:
97d7a9175f67972cd11d7d4ac6de9f431fddc1da / Max Michels <max_mich...@apple.com>
[FLINK-33771] Add cluster capacity awareness to autoscaler

To avoid starvation of pipelines when the Kubernetes cluster runs out of
resources, new scaling attempts should be stopped when the no more additional
pods can be scheduled for rescaling.

While Flink's ResourceRequirement API can prevent some of these cases, it
requires using Flink 1.18 and an entirely different Flink scheduler. Extensive
testing still has to be done with the new scheduler and the rescaling
behavior. We woud hand off control over the rescale time to Flink which uses
various parameters to control the exact scaling behavior.

For the config-based parallelism overrides, we have pretty good heuristics in
the operator to check in Kubernetes for the approximate number of free cluster
resources, the max cluster scaleup for the Cluster Autoscaler, and the required
scaling costs. Having cluster resource information will also allow to implement
fairness between all the autoscaled pipelines.

This PR adds ClusterResourceManager which which provides a view over the
allocatable resources within a Kubernetes cluster and allows to simulate
scheduling pods with a defined number of required resources.

The goal is to provide a good indicator for whether resources needed for
autoscaling are going to be available. This is achieved by pulling the node
resource usage from the Kubernetes cluster at a regular configurable interval,
after which we use this data to simulate adding / removing
resources (pods). Note that this is merely a (pretty good) heuristic because the
Kubernetes scheduler has the final saying. However, we prevent 99% of the
scenarios after pipeline outages which can lead to massive scale up where all
pipelines may be scaled up at the same time and exhaust the number of available
resources.

The simulation can run on a fixed set of Kubernetes nodes. Additionally, if we
detect that the cluster is using the Kubernetes Cluster Autoscaler, we will use
this data to extrapolate the number of nodes to the maximum defined nodes in the
autoscaler configuration.  We currently track CPU and memory. Ephemeral storage
is missing because there is no easy way to get node statics on free storage.

Report URL: 
https://github.com/apache/flink-kubernetes-operator/actions/runs/7452407351

With regards,
GitHub Actions via GitBox

[GH] (flink-kubernetes-operator): Workflow run "CI" failed!

Reply via email to