Hi Xintong Song, Correct, we are using standalone k8s. Task managers are deployed as a statefulset so have consistent pod names. We tried using native k8s (in fact I'd prefer to) but got persistent "io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 242214695 (242413759)" errors which resulted in jobs being restarted every 30-60 minutes.
We are using Prometheus Node Exporter to capture memory usage. The graph shows the metric: sum(container_memory_usage_bytes{container_name="taskmanager",pod_name=~"$flink_task_manager"}) by (pod_name) I've attached the original <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2869/Screenshot_2021-02-02_at_11.png> so Nabble doesn't shrink it. Best regards, Randal. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/