[
https://issues.apache.org/jira/browse/FLINK-38212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grzegorz Liter updated FLINK-38212:
-----------------------------------
Description:
I am running a job with snapshot size about ~17 GB with compression enabled. I
have observed that savepoints often fails due to TM getting killed by
Kubernetes due to exceeding memory limit on pod that had 30 GB of memory limit
assigned.
Flink metrics nor detailed VM metrics taken with `jcmd <PID> VM.native_memory
detail` does not indicate any unusual memory increase. Consumed memory is
visible only in Kubernetes metrics and RSS.
When enough memory set (+ potentially setting enough jvm overhead) to leave
some breathing room one snapshot could be taken but taking subsequent full
snapshots reliably leads to OOM.
This documentation:
[switching-the-memory-allocator|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#switching-the-memory-allocator]
have lead me to trying
{code:java}
MALLOC_ARENA_MAX=1
DISABLE_JEMALLOC=true {code}
This configuration helped to make savepoint reliably pass without OOM. I have
trying setting only one of each options at once but that was not fixing the
issue.
I also tried downscaling pod down to 16 GB of memory and with these options
savepoint was reliably created without any issue. Without them every savepoint
fails.
h3.
Setup:
Flink 2.1.0 running in Application mode with Flink Operator 1.12.1.
Memory and savepoint related settings:
{code:java}
env.java.opts.taskmanager: ' -XX:+UnlockExperimentalVMOptions
-XX:+UseStringDeduplication
-XX:+AlwaysPreTouch -XX:G1HeapRegionSize=16m
-Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags
-XX:SurvivorRatio=6 -XX:G1NewSizePercent=40
execution.checkpointing.max-concurrent-checkpoints: "1"
execution.checkpointing.snapshot-compression: "true"
fs.s3a.aws.credentials.provider:
com.amazonaws.auth.WebIdentityTokenCredentialsProvider
fs.s3a.block.size:
fs.s3a.experimental.input.fadvise: sequential
fs.s3a.path.style.access: "true"
state.backend.incremental: "true"
state.backend.type: rocksdb
state.checkpoints.dir: s3p://bucket/checkpoints
state.savepoints.dir: s3p://bucket/savepoints
taskmanager.memory.jvm-overhead.fraction: "0.1"
taskmanager.memory.jvm-overhead.max: 6g
taskmanager.memory.managed.fraction: "0.4"
taskmanager.memory.network.fraction: "0.05"
taskmanager.network.memory.buffer-debloat.enabled: "true"
taskmanager.numberOfTaskSlots: "12"
...
resource:
memory: 16g{code}
was:
I am running a job with snapshot size about ~17 GB with compression enabled. I
have observed that savepoints often fails due to TM getting killed by
Kubernetes due to exceeding memory limit on pod that had 30 GB of memory limit
assigned.
Flink metrics nor detailed VM metrics taken with `jcmd <PID> VM.native_memory
detail` does not indicate any unusual memory increase. Consumed memory is
visible only in Kubernetes metrics and RSS.
When enough memory set (+ potentially setting enough jvm overhead) to leave
some breathing room one snapshot could be taken but taking subsequent full
snapshots reliably leads to OOM.
This documentation:
[switching-the-memory-allocator|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#switching-the-memory-allocator]
have lead me to trying
{code:java}
MALLOC_ARENA_MAX=1
DISABLE_JEMALLOC=true {code}
This configuration helped to make savepoint reliably pass without OOM. I have
trying setting only one of each options at once but that was not fixing the
issue.
I also tried downscaling pod down to 16 GB of memory and with these options
savepoint was reliably created without any issue. Without them every savepoint
fails.
Flink 2.1.0 running in Application mode with Flink Operator 1.12.1.
Memory and savepoint related settings:
{code:java}
env.java.opts.taskmanager: ' -XX:+UnlockExperimentalVMOptions
-XX:+UseStringDeduplication
-XX:+AlwaysPreTouch -XX:G1HeapRegionSize=16m
-Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags
-XX:SurvivorRatio=6 -XX:G1NewSizePercent=40
execution.checkpointing.max-concurrent-checkpoints: "1"
execution.checkpointing.snapshot-compression: "true"
fs.s3a.aws.credentials.provider:
com.amazonaws.auth.WebIdentityTokenCredentialsProvider
fs.s3a.block.size:
fs.s3a.experimental.input.fadvise: sequential
fs.s3a.path.style.access: "true"
state.backend.incremental: "true"
state.backend.type: rocksdb
state.checkpoints.dir: s3p://bucket/checkpoints
state.savepoints.dir: s3p://bucket/savepoints
taskmanager.memory.jvm-overhead.fraction: "0.1"
taskmanager.memory.jvm-overhead.max: 6g
taskmanager.memory.managed.fraction: "0.4"
taskmanager.memory.network.fraction: "0.05"
taskmanager.network.memory.buffer-debloat.enabled: "true"
taskmanager.numberOfTaskSlots: "12"
...
resource:
memory: 16g{code}
> OOM during savepoint caused by potential memory leak issue in RocksDB related
> to jemalloc
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-38212
> URL: https://issues.apache.org/jira/browse/FLINK-38212
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.20.2, 2.1.0
> Reporter: Grzegorz Liter
> Priority: Major
> Attachments: image-2025-08-07-17-14-03-023.png,
> image-2025-08-07-17-15-11-647.png
>
>
> I am running a job with snapshot size about ~17 GB with compression enabled.
> I have observed that savepoints often fails due to TM getting killed by
> Kubernetes due to exceeding memory limit on pod that had 30 GB of memory
> limit assigned.
> Flink metrics nor detailed VM metrics taken with `jcmd <PID> VM.native_memory
> detail` does not indicate any unusual memory increase. Consumed memory is
> visible only in Kubernetes metrics and RSS.
> When enough memory set (+ potentially setting enough jvm overhead) to leave
> some breathing room one snapshot could be taken but taking subsequent full
> snapshots reliably leads to OOM.
> This documentation:
> [switching-the-memory-allocator|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#switching-the-memory-allocator]
> have lead me to trying
> {code:java}
> MALLOC_ARENA_MAX=1
> DISABLE_JEMALLOC=true {code}
> This configuration helped to make savepoint reliably pass without OOM. I have
> trying setting only one of each options at once but that was not fixing the
> issue.
> I also tried downscaling pod down to 16 GB of memory and with these options
> savepoint was reliably created without any issue. Without them every
> savepoint fails.
> h3.
> Setup:
> Flink 2.1.0 running in Application mode with Flink Operator 1.12.1.
> Memory and savepoint related settings:
> {code:java}
> env.java.opts.taskmanager: ' -XX:+UnlockExperimentalVMOptions
> -XX:+UseStringDeduplication
> -XX:+AlwaysPreTouch -XX:G1HeapRegionSize=16m
> -Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags
> -XX:SurvivorRatio=6 -XX:G1NewSizePercent=40
> execution.checkpointing.max-concurrent-checkpoints: "1"
> execution.checkpointing.snapshot-compression: "true"
> fs.s3a.aws.credentials.provider:
> com.amazonaws.auth.WebIdentityTokenCredentialsProvider
> fs.s3a.block.size:
> fs.s3a.experimental.input.fadvise: sequential
> fs.s3a.path.style.access: "true"
> state.backend.incremental: "true"
> state.backend.type: rocksdb
> state.checkpoints.dir: s3p://bucket/checkpoints
> state.savepoints.dir: s3p://bucket/savepoints
> taskmanager.memory.jvm-overhead.fraction: "0.1"
> taskmanager.memory.jvm-overhead.max: 6g
> taskmanager.memory.managed.fraction: "0.4"
> taskmanager.memory.network.fraction: "0.05"
> taskmanager.network.memory.buffer-debloat.enabled: "true"
> taskmanager.numberOfTaskSlots: "12"
> ...
> resource:
> memory: 16g{code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)