Hi, I am trying to understand the following behavior in our Flink application cluster. Any assistance would be appreciated.
We are running a Flink application cluster with 5 task managers, each with the following configuration: - jobManagerMemory: 12g - taskManagerMemory: 20g - taskManagerMemoryHeapSize: 12g - taskManagerMemoryNetworkMax: 4g - taskManagerMemoryNetworkMin: 1g - taskManagerMemoryManagedSize: 50m - taskManagerMemoryOffHeapSize: 2g - taskManagerMemoryNetworkFraction: 0.2 - taskManagerNetworkMemorySegmentSize: 4mb - taskManagerMemoryFloatingBuffersPerGate: 64 - taskmanager.memory.jvm-overhead.min: 256mb - taskmanager.memory.jvm-overhead.max: 2g - taskmanager.memory.jvm-overhead.fraction: 0.1 Our pipeline includes stateful transformations, and we are verifying that we clear the state once it is no longer needed. Through the Flink UI, we observe that the heap size increases and decreases during the job lifecycle. However, there is a noticeable delay between clearing the state and the reduction in heap size usage, which I assume is related to the garbage collector frequency. What is puzzling is the task manager pod memory usage. It appears that the memory usage increases intermittently and is not released. We verified the different state metrics and confirmed they are changing according to the logic. Additionally, if we had a state that was never released, I would expect to see the heap size increasing constantly as well. Any insights or ideas? Thanks, Sigalit