Hi Ori, It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?
It would be helpful if you can briefly introduce your workload. - What kind of workload are you running? Streaming or batch? - Do you use RocksDB state backend? - Any UDFs or 3rd party dependencies that might allocate significant native memory? Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages. Thank you~ Xintong Song On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <ori....@gmail.com> wrote: > Hi Xintong, > > See here: > > # Top memory users > ps auxwww --sort -rss | head -10 > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 > /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max > root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 > /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X > hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 > /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X > yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 > /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf > root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 > /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea > root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 > /usr/lib/systemd/systemd-journald > root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 > /usr/bin/amazon-ssm-agent > root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww > --sort -rss > root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 > /usr/sbin/CROND -n > > On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <tonysong...@gmail.com> > wrote: > >> Hi Ori, >> >> The error message suggests that there's not enough physical memory on the >> machine to satisfy the allocation. This does not necessarily mean a managed >> memory leak. Managed memory leak is only one of the possibilities. There >> are other potential reasons, e.g., another process/container on the machine >> used more memory than expected, Yarn NM is not configured with enough >> memory reserved for the system processes, etc. >> >> I would suggest to first look into the machine memory usages, see whether >> the Flink process indeed uses more memory than expected. This could be >> achieved via: >> - Run the `top` command >> - Look into the `/proc/meminfo` file >> - Any container memory usage metrics that are available to your Yarn >> cluster >> >> Thank you~ >> >> Xintong Song >> >> >> >> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <ori....@gmail.com> wrote: >> >>> After the job is running for 10 days in production, TaskManagers start >>> failing with: >>> >>> Connection unexpectedly closed by remote task manager >>> >>> Looking in the machine logs, I can see the following error: >>> >>> ============= Java processes for user hadoop ============= >>> OpenJDK 64-Bit Server VM warning: INFO: >>> os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot >>> allocate memory' (err >>> # >>> # There is insufficient memory for the Java Runtime Environment to >>> continue. >>> # Native memory allocation (mmap) failed to map 1006567424 bytes for >>> committing reserved memory. >>> # An error report file with more information is saved as: >>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>> =========== End java processes for user hadoop =========== >>> >>> In addition, the metrics for the TaskManager show very low Heap memory >>> consumption (20% of Xmx). >>> >>> Hence, I suspect there is a memory leak in the TaskManager's Managed >>> Memory. >>> >>> This my TaskManager's memory detail: >>> flink process 112g >>> framework.heap.size 0.2g >>> task.heap.size 50g >>> managed.size 54g >>> framework.off-heap.size 0.5g >>> task.off-heap.size 1g >>> network 2g >>> XX:MaxMetaspaceSize 1g >>> >>> As you can see, the managed memory is 54g, so it's already high (my >>> managed.fraction is set to 0.5). >>> >>> I'm running Flink 1.10. Full job details attached. >>> >>> Can someone advise what would cause a managed memory leak? >>> >>> >>>