Hi Where are you storing the state. Try rocksdb. Thanks Sachin
On Thu, 23 May 2024 at 6:19 PM, Sigalit Eliazov <e.siga...@gmail.com> wrote: > Hi, > > I am trying to understand the following behavior in our Flink application > cluster. Any assistance would be appreciated. > > We are running a Flink application cluster with 5 task managers, each with > the following configuration: > > - jobManagerMemory: 12g > - taskManagerMemory: 20g > - taskManagerMemoryHeapSize: 12g > - taskManagerMemoryNetworkMax: 4g > - taskManagerMemoryNetworkMin: 1g > - taskManagerMemoryManagedSize: 50m > - taskManagerMemoryOffHeapSize: 2g > - taskManagerMemoryNetworkFraction: 0.2 > - taskManagerNetworkMemorySegmentSize: 4mb > - taskManagerMemoryFloatingBuffersPerGate: 64 > - taskmanager.memory.jvm-overhead.min: 256mb > - taskmanager.memory.jvm-overhead.max: 2g > - taskmanager.memory.jvm-overhead.fraction: 0.1 > > Our pipeline includes stateful transformations, and we are verifying that > we clear the state once it is no longer needed. > > Through the Flink UI, we observe that the heap size increases and > decreases during the job lifecycle. > > However, there is a noticeable delay between clearing the state and the > reduction in heap size usage, which I assume is related to the garbage > collector frequency. > > What is puzzling is the task manager pod memory usage. It appears that the > memory usage increases intermittently and is not released. We verified the > different state metrics and confirmed they are changing according to the > logic. > > Additionally, if we had a state that was never released, I would expect to > see the heap size increasing constantly as well. > > Any insights or ideas? > > Thanks, > > Sigalit >