Hi all,

We're debugging an issue with OOMs that occurs on our jobs shortly after a
restore from checkpoint. Our application is running on kubernetes and uses
RocksDB as it's state backend.

We reproduced the issue on a small cluster of 2 task managers. If we killed
a single task manager, we noticed that after restoring from checkpoint, the
untouched task manager has an elevated memory footprint (see the blue line
for the surviving task manager):

[image: image.png]
If we kill the newest TM (yellow line) again, after restoring the surviving
task manager gets OOM killed.

We looked at the OOMKiller Report and it seems that the memory is not
coming from the JVM but we're unsure of the source. It seems like something
is allocating native memory that the JVM is not aware of.

We're suspicious of RocksDB. Has anyone seen this kind of issue before? Is
it possible there's some kind of memory pressure or memory leak coming from
RocksDB that only presents itself when a job is restarted? Perhaps
something isn't cleaned up?

Any help would be appreciated.

Reply via email to