Hi Everyone, We are using the TableStream API of Flink v1.14.3 with HA Kubernetes enabled, along with Flink K8s Operator v1.6. One of the jobs, which had been running stably for a long time, started restarting frequently. Upon closer inspection, we observed that the container memory usage exceeded 90%. As a rule of thumb, we allocated 30% more memory for the JobManager.
We would like to proactively identify and address these kinds of issues. To this end, we've planned to set up an alert using K8s metric to trigger when JobManager memory usage exceeds 90%. My questions are: a) What Flink metrics can be used to identify this issue? b) How can this be presented in logs? Steps we took to understand the issue: 1. Nowhere in the JobManager log did it mention insufficient memory when it restarted. All we saw was that the heartbeat response was missing from the ResourceManager. 2. Kubernetes did not evict the pod with the OOMKilled error. 3. JobManager was shutting down all its components gracefully. 4. Flink Kubernetes Operator logs did not contain any related entries. We concluded this issue by setting up a duplicate job with 90% of the resources, which caused the JobManager to restart frequently. If this issue is fixed in later versions, please help point it out. Thank you.