Hello,
I am running a streaming job on a small cluster, and after a few hours I
noticed that my TaskManager processes are being killed by the OOM killer. The
processes were using too much memory. After a bit of monitoring, I have the
following status:
* The maximum heap size (Xmx) is 4M
* Native Memory Tracking reports that the process has 44180459KB committed,
which is reasonable given the GC and threads overhead (the full summery report
is attached later)
* There are 644 threads
* The Status.JVM.Memory.NonHeap.Committed metric is 245563392
* The Status.JVM.Memory.Direct.MemoryUsed metric is 354777032
* Using pmap we find that the private committed memory is 54879428K and
mapped is 62237852K
So we have about 10GB of memory that was allocated in the process but is
unknown to the JVM itself.
Some more info:
* I am running Flink 1.1.4
* I am using RocksDB for state
* The state is saved to Azure Blob Store, using the NativeAzureFileSystem
HDFS connector over the wasbs protocol
* The cluster is a standalone HA cluster
* The machine is an Ubuntu 14.04.5 LTS 64 bit server
* I have around 2 GB of state per TaskManager
Another thing I noticed is that the job sometimes fails (due to external DB
connectivity issues) and is restarted automatically as expected. But in some
cases the failures also cause one or more of the following error logs:
* Could not close the file system output stream. Trying to delete the
underlying file.
* Could not discard the 1th operator state of checkpoint 93 for operator
Operator2.
java.lang.NullPointerException: null
at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointThread.run(StreamTask.java:953)
~[flink-dist_2.10-1.1.4.jar:1.1.4]
I have 2 theories, and I hope to hear any ideas from you:
1. RocksDB uses this memory for internal caching. If so, how can this usage
be tracked, and what options can tune and limit it?
2. Internal RocksDB objects are not being disposed of properly, probably
during the aforementioned job restarts, and so we have a growing memory leak.
If so, do you have any idea what can cause this?
Thank you,
Avihai
Attached Native Memory Tracking (jcmd VM.native_memory summary):
Total: reserved=44399603KB, committed=44180459KB
- Java Heap (reserved=4096KB, committed=4096KB)
(mmap: reserved=4096KB, committed=4096KB)
- Class (reserved=134031KB, committed=132751KB)
(classes #22310)
(malloc=2959KB #43612)
(mmap: reserved=131072KB, committed=129792KB)
-Thread (reserved=716331KB, committed=716331KB)
(thread #694)
(stack: reserved=712404KB, committed=712404KB)
(malloc=2283KB #3483)
(arena=1644KB #1387)
- Code (reserved=273273KB, committed=135409KB)
(malloc=23673KB #30410)
(mmap: reserved=249600KB, committed=111736KB)
-GC (reserved=1635902KB, committed=1635902KB)
(malloc=83134KB #70605)
(mmap: reserved=1552768KB, committed=1552768KB)
- Compiler (reserved=1634KB, committed=1634KB)
(malloc=1504KB #2062)
(arena=131KB #3)
- Internal (reserved=575283KB, committed=575283KB)
(malloc=575251KB #106644)
(mmap: reserved=32KB, committed=32KB)
-Symbol (reserved=16394KB, committed=16394KB)
(malloc=14468KB #132075)
(arena=1926KB #1)
-Native Memory Tracking (reserved=6516KB, committed=6516KB)
(malloc=338KB #5024)
(tracking overhead=6178KB)
- Arena Chunk (reserved=237KB, committed=237KB)
(malloc=237KB)
- Unknown (reserved=8KB, committed=0KB)
(mmap: reserved=8KB, committed=0KB)