Hi Everyone,

We are using the TableStream API of Flink v1.14.3 with HA Kubernetes
enabled, along with Flink K8s Operator v1.6. One of the jobs, which had
been running stably for a long time, started restarting frequently. Upon
closer inspection, we observed that the container memory usage exceeded
90%. As a rule of thumb, we allocated 30% more memory for the JobManager.

We would like to proactively identify and address these kinds of issues. To
this end, we've planned to set up an alert using K8s metric to trigger when
JobManager memory usage exceeds 90%.

My questions are:
a) What Flink metrics can be used to identify this issue?
b) How can this be presented in logs?

Steps we took to understand the issue:

1. Nowhere in the JobManager log did it mention insufficient memory when it
restarted. All we saw was that the heartbeat response was missing from the
ResourceManager.
2. Kubernetes did not evict the pod with the OOMKilled error.
3. JobManager was shutting down all its components gracefully.
4. Flink Kubernetes Operator logs did not contain any related entries.

We concluded this issue by setting up a duplicate job with 90% of the
resources, which caused the JobManager to restart frequently.

If this issue is fixed in later versions, please help point it out.

Thank you.

Reply via email to