Hi Marco,

When you say OOM, I assume you mean TM pod being OOMKilled, is that correct? If 
so, this usually means that the TM is using more than the actual memory 
allocated to the pod. First I would check your memory configuration to figure 
out where this extra memory use is coming from. This is a non trivial task, and 
I’ll list down some common situations I’ve seen tin the past to get you started.


  *   Misconfigured process memory. Flink configuration of 
`taskmanager.memory.process.size` will set the memory of the entire TM, which 
Flink will use and break down into smaller buckets. IF this is higher than 
memory resource of container, this will cause OOMKilled situations
  *   User code has memory leak (e.g. spins up too many threads). Would be 
useful to test the Flink job you have on a local cluster and monitor the memory 
use.
  *   State backend (if you use rocksdb) using too much memory.

You can also look at [1] and [2] for more information.

Regards,
Hong

[1] Talk on Flink memory utilisation https://www.youtube.com/watch?v=F5yKSznkls8
[2] Flink description of TM memory breakdown 
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/


From: marco andreas <marcoandreas...@gmail.com>
Date: Wednesday, 25 January 2023 at 19:57
To: user <user@flink.apache.org>
Subject: [EXTERNAL] OOM taskmanager


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.




Hello,

We are deploying a flink application cluster in kubernetes, 2 pods one for the 
JM and the other for the TM.

The problem is when we launch load tests we see that task manager memory usage 
increases,  after the tests  are finished and flink stop processing data the 
memory usage never comes down where it was before, eventually when we launch 
tests again and again the memory of TM continues to grow until it reaches the 
memory resource limit specified in the container templates and it get killed 
because of OOM.


Has anyone faced the same issue and what is the best way to investigate this 
error in order to know the root cause of why the memory usage of the TM never 
comes down when flink finishes processing.

FLink version is 1.16.0.
Thanks,

Reply via email to