Re: TM OOMKilled

Caizhi Weng Mon, 14 Feb 2022 18:42:32 -0800

Hi!

Heap memory usage depends heavily on your job and your state backend. Which
state backend are you using and if possible could you share your user code
or explain what operations your job is doing?


Alexey Trenikhun <yen...@msn.com> 于2022年2月15日周二 05:17写道：

> Hello,
> We run Flink 1.13.5 job in app mode in Kubernetes, 1 JM and 1 TM, we also
> have Kubernetes cron job which takes savepoint every 2 hour (14 */2 * * *),
> once in while (~1 per 2 days) TM is OOMKilled, suspiciously it happens on
> even hours ~4 minutes after savepoint start (e.g. 12:18, 4:18) but I don't
> see failed save points, so I assume OOM happens right after savepoint
> taken. However OOMKilled doesn't happen on every save point, so maybe this
> is a random correlation.
> I've reserved 2G for JVM overhead, but somehow it is not enough ? Any
> known issues with memory and savepoints? Any suggestions how to
> troubleshoot this?
>
>  Final TaskExecutor Memory configuration:
>    Total Process Memory:          10.000gb (10737418240 bytes)
>      Total Flink Memory:          7.547gb (8103395328 bytes)
>        Total JVM Heap Memory:     3.523gb (3783262149 bytes)
>          Framework:               128.000mb (134217728 bytes)
>          Task:                    3.398gb (3649044421 bytes)
>        Total Off-heap Memory:     4.023gb (4320133179 bytes)
>          Managed:                 3.019gb (3241358179 bytes)
>          Total JVM Direct Memory: 1.005gb (1078775000 bytes)
>            Framework:             128.000mb (134217728 bytes)
>            Task:                  128.000mb (134217728 bytes)
>            Network:               772.800mb (810339544 bytes)
>      JVM Metaspace:               256.000mb (268435456 bytes)
>      JVM Overhead:                2.203gb (2365587456 bytes)
>
> Thanks,
> Alexey
>

Re: TM OOMKilled

Reply via email to