RE: RocksDB: Memory is slowly increasing over time

Guozhen Yang Wed, 18 Oct 2023 00:32:23 -0700

Hi, Patrick:

We have encountered the same issue, that TaskManager's memory consumption
increases almost monotonously.

I'll try to describe what we have observed and our solution. You can check
if it would solve the problem.

We have observed that
1. Jobs with RocksDB state backend would fail after a random period of time
after deploying. All failures were TaskManager pod OOM killed by k8s. No
jvm exceptions like "java.lang.OutOfMemoryError: Java heap space" have ever
happened.
2. TaskManager pods were still OOM killed by k8s after setting
kubernetes.taskmanager.memory.limit-factor to
some number larger than 1.0, like 1.5 or 2.0. The config
kubernetes.taskmanager.memory.limit-factor controls
the ratio between memory limit and request submitted to k8s.
3. There would be a longer period of time before the first OOM killed event
if we had requested more memory from k8s. container_memory_working_set_bytes
increased almost monotonously. And pods got OOM killed when
container_memory_working_set_bytes
hitted the configured memory limit.
4. We used jeprof to profile RocksDB's native memory allocation. We found
no memory leak. RocksDB's overall native memory consumption was less than
managed memory configured.
5. container_memory_rss was way larger than memory requested. We gathered
some statistics by using jcmd and jeprof. It turns out that rss was way
larger than the memory size consumed by JVM and Jemalloc.

So we draw conclusions from what we have observed
1. The memory issue is not caused by JVM. Since we would see JVM exceptions
in log if it was caused by JVM.
2. We googled and found Flink cannot control RocksDB's memory consumption
precisely so there will be some memory over-request. But we believe it's
not our case since in our case we request 8GB memory and limit it to 16GB.
We believe there may be some memory over-request but not that size.
3. We suspect that there may be a memory leak issue. We googled and there
were many conversations about memory leaks caused by RocksDB.
4. But since the jeprof result shows no memory leak. We denied the
suspicion that RocksDB caused a memory leak and caused the memory issue.
5. We suspect Jemalloc caused the memory issue. Since the jeprof result
shows no memory issue but container_memory_rss indicates there's a memory
issue. We suspect there's a gap between the memory statistic I mentioned
above and the container_memory_rss metric.

We did some google search and found that Jemalloc does not cope well with
Transparent Huge Page, which is defaultly set to always at our host.
Briefly speaking, Jemalloc requests huge pages (2MB pages but not 4KB
pages) from the kernel but tells the kernel part of the 2MB page is not
used so the kernel can free this part. But since the kernel does not split
the huge page into normal 4KB pages, the kernel would never free the whole
huge page if the whole huge page is not marked ok to be freed by the kernel.

Many database systems like redis, mongo, oracle recommend to disable
Transparent Huge Page. So we disabled this kernel function. After disabling
Transparent Huge Page, we observed no memory issue anymore.

Hope our experience will help you.

On 2023/10/17 13:41:02 "Eifler, Patrick" wrote:
> Hello,
>
> We are running Flink jobs on K8s and using RocksDB as state backend. It
is connected to S3 for checkpointing. We have multiple states in the job
(mapstate and value states). We are seeing a slow but stable increase over
time on the memory consumption. We only see this in our jobs connected to
RocksDB.
>
>
> We are currently using the default memory setting
(state-backend-rocksdb-memory-managed=true). Now we are wondering what a
good alternative setting would be. We want to try to enable
thestate.backend.rocksdb.memory.partitioned-index-filters but it only takes
effect if the managed memory is turned off, so we need to figure out what
would be a good amount for memory.fixed-per-slot.
>
> Any hint what a good indicator for that calculation would be?
> Any other experience if someone has seen similar behavior before would
also be much appreciated.
> Thanks!
>
> Best Regards
>
> --
> Patrick Eifler
>
>

RE: RocksDB: Memory is slowly increasing over time

Reply via email to