Re: Native memory allocation (mmap) failed to map 1006567424 bytes
crashing after 10 days instead of 7 days). It >>>> looks like anything I'll do on that front will only postpone the problem >>>> but not solve it. >>>> >>>> I am attaching the full job configuration. >>>> >>>> >>>> >>>> On Thu, Oct 29, 2020 at 10:09 AM Xintong Song >>>> wrote: >>>> >>>>> Hi Ori, >>>>> >>>>> It looks like Flink indeed uses more memory than expected. I assume >>>>> the first item with PID 20331 is the flink process, right? >>>>> >>>>> It would be helpful if you can briefly introduce your workload. >>>>> - What kind of workload are you running? Streaming or batch? >>>>> - Do you use RocksDB state backend? >>>>> - Any UDFs or 3rd party dependencies that might allocate significant >>>>> native memory? >>>>> >>>>> Moreover, if the metrics shows only 20% heap usages, I would suggest >>>>> configuring less `task.heap.size`, leaving more memory to off-heap. The >>>>> reduced heap size does not necessarily all go to the managed memory. You >>>>> can also try increasing the `jvm-overhead`, simply to leave more native >>>>> memory in the container in case there are other other significant native >>>>> memory usages. >>>>> >>>>> Thank you~ >>>>> >>>>> Xintong Song >>>>> >>>>> >>>>> >>>>> On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski >>>>> wrote: >>>>> >>>>>> Hi Xintong, >>>>>> >>>>>> See here: >>>>>> >>>>>> # Top memory users >>>>>> ps auxwww --sort -rss | head -10 >>>>>> USER PID %CPU %MEMVSZ RSS TTY STAT START TIME >>>>>> COMMAND >>>>>> yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 >>>>>> /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max >>>>>> root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 >>>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>>>> hadoop5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 >>>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>>>> yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 >>>>>> /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf >>>>>> root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 >>>>>> /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea >>>>>> root 1781 0.0 0.0 172644 8096 ?Ss Jul30 2:06 >>>>>> /usr/lib/systemd/systemd-journald >>>>>> root 4801 0.0 0.0 2690260 4776 ?Ssl Jul30 4:42 >>>>>> /usr/bin/amazon-ssm-agent >>>>>> root 6566 0.0 0.0 164672 4116 ?R00:30 0:00 ps >>>>>> auxwww --sort -rss >>>>>> root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 >>>>>> /usr/sbin/CROND -n >>>>>> >>>>>> On Wed, Oct 28, 2020 at 11:34 AM Xintong Song >>>>>> wrote: >>>>>> >>>>>>> Hi Ori, >>>>>>> >>>>>>> The error message suggests that there's not enough physical memory >>>>>>> on the machine to satisfy the allocation. This does not necessarily >>>>>>> mean a >>>>>>> managed memory leak. Managed memory leak is only one of the >>>>>>> possibilities. >>>>>>> There are other potential reasons, e.g., another process/container on >>>>>>> the >>>>>>> machine used more memory than expected, Yarn NM is not configured with >>>>>>> enough memory reserved for the system processes, etc. >>>>>>> >>>>>>> I would suggest to first look into the machine memory usages, see >>>>>>> whether the Flink process indeed uses more memory than expected. This >>>>>>> could >>>>>>> be achieved via: >>>>>>> - Run the `top` command >>>>>>> - Look into the `/proc/meminfo` file >>>>>>> - Any container memory usage metrics that are available to your Yarn >>>>>>> cluster >>>>>>> >>>>>>> Thank you~ >>>>>>> >>>>>>> Xintong Song >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski >>>>>>> wrote: >>>>>>> >>>>>>>> After the job is running for 10 days in production, TaskManagers >>>>>>>> start failing with: >>>>>>>> >>>>>>>> Connection unexpectedly closed by remote task manager >>>>>>>> >>>>>>>> Looking in the machine logs, I can see the following error: >>>>>>>> >>>>>>>> = Java processes for user hadoop = >>>>>>>> OpenJDK 64-Bit Server VM warning: INFO: >>>>>>>> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; >>>>>>>> error='Cannot >>>>>>>> allocate memory' (err >>>>>>>> # >>>>>>>> # There is insufficient memory for the Java Runtime Environment to >>>>>>>> continue. >>>>>>>> # Native memory allocation (mmap) failed to map 1006567424 bytes >>>>>>>> for committing reserved memory. >>>>>>>> # An error report file with more information is saved as: >>>>>>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>>>>>>> === End java processes for user hadoop === >>>>>>>> >>>>>>>> In addition, the metrics for the TaskManager show very low Heap >>>>>>>> memory consumption (20% of Xmx). >>>>>>>> >>>>>>>> Hence, I suspect there is a memory leak in the TaskManager's >>>>>>>> Managed Memory. >>>>>>>> >>>>>>>> This my TaskManager's memory detail: >>>>>>>> flink process 112g >>>>>>>> framework.heap.size 0.2g >>>>>>>> task.heap.size 50g >>>>>>>> managed.size 54g >>>>>>>> framework.off-heap.size 0.5g >>>>>>>> task.off-heap.size 1g >>>>>>>> network 2g >>>>>>>> XX:MaxMetaspaceSize 1g >>>>>>>> >>>>>>>> As you can see, the managed memory is 54g, so it's already high (my >>>>>>>> managed.fraction is set to 0.5). >>>>>>>> >>>>>>>> I'm running Flink 1.10. Full job details attached. >>>>>>>> >>>>>>>> Can someone advise what would cause a managed memory leak? >>>>>>>> >>>>>>>> >>>>>>>>
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
nd of workload are you running? Streaming or batch? >>>> - Do you use RocksDB state backend? >>>> - Any UDFs or 3rd party dependencies that might allocate significant >>>> native memory? >>>> >>>> Moreover, if the metrics shows only 20% heap usages, I would suggest >>>> configuring less `task.heap.size`, leaving more memory to off-heap. The >>>> reduced heap size does not necessarily all go to the managed memory. You >>>> can also try increasing the `jvm-overhead`, simply to leave more native >>>> memory in the container in case there are other other significant native >>>> memory usages. >>>> >>>> Thank you~ >>>> >>>> Xintong Song >>>> >>>> >>>> >>>> On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski wrote: >>>> >>>>> Hi Xintong, >>>>> >>>>> See here: >>>>> >>>>> # Top memory users >>>>> ps auxwww --sort -rss | head -10 >>>>> USER PID %CPU %MEMVSZ RSS TTY STAT START TIME >>>>> COMMAND >>>>> yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 >>>>> /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max >>>>> root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 >>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>>> hadoop5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 >>>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>>> yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 >>>>> /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf >>>>> root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 >>>>> /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea >>>>> root 1781 0.0 0.0 172644 8096 ?Ss Jul30 2:06 >>>>> /usr/lib/systemd/systemd-journald >>>>> root 4801 0.0 0.0 2690260 4776 ?Ssl Jul30 4:42 >>>>> /usr/bin/amazon-ssm-agent >>>>> root 6566 0.0 0.0 164672 4116 ?R00:30 0:00 ps >>>>> auxwww --sort -rss >>>>> root 6532 0.0 0.0 183124 3592 ?S00:30 0:00 >>>>> /usr/sbin/CROND -n >>>>> >>>>> On Wed, Oct 28, 2020 at 11:34 AM Xintong Song >>>>> wrote: >>>>> >>>>>> Hi Ori, >>>>>> >>>>>> The error message suggests that there's not enough physical memory on >>>>>> the machine to satisfy the allocation. This does not necessarily mean a >>>>>> managed memory leak. Managed memory leak is only one of the >>>>>> possibilities. >>>>>> There are other potential reasons, e.g., another process/container on the >>>>>> machine used more memory than expected, Yarn NM is not configured with >>>>>> enough memory reserved for the system processes, etc. >>>>>> >>>>>> I would suggest to first look into the machine memory usages, see >>>>>> whether the Flink process indeed uses more memory than expected. This >>>>>> could >>>>>> be achieved via: >>>>>> - Run the `top` command >>>>>> - Look into the `/proc/meminfo` file >>>>>> - Any container memory usage metrics that are available to your Yarn >>>>>> cluster >>>>>> >>>>>> Thank you~ >>>>>> >>>>>> Xintong Song >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski >>>>>> wrote: >>>>>> >>>>>>> After the job is running for 10 days in production, TaskManagers >>>>>>> start failing with: >>>>>>> >>>>>>> Connection unexpectedly closed by remote task manager >>>>>>> >>>>>>> Looking in the machine logs, I can see the following error: >>>>>>> >>>>>>> = Java processes for user hadoop = >>>>>>> OpenJDK 64-Bit Server VM warning: INFO: >>>>>>> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; >>>>>>> error='Cannot >>>>>>> allocate memory' (err >>>>>>> # >>>>>>> # There is insufficient memory for the Java Runtime Environment to >>>>>>> continue. >>>>>>> # Native memory allocation (mmap) failed to map 1006567424 bytes for >>>>>>> committing reserved memory. >>>>>>> # An error report file with more information is saved as: >>>>>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>>>>>> === End java processes for user hadoop === >>>>>>> >>>>>>> In addition, the metrics for the TaskManager show very low Heap >>>>>>> memory consumption (20% of Xmx). >>>>>>> >>>>>>> Hence, I suspect there is a memory leak in the TaskManager's Managed >>>>>>> Memory. >>>>>>> >>>>>>> This my TaskManager's memory detail: >>>>>>> flink process 112g >>>>>>> framework.heap.size 0.2g >>>>>>> task.heap.size 50g >>>>>>> managed.size 54g >>>>>>> framework.off-heap.size 0.5g >>>>>>> task.off-heap.size 1g >>>>>>> network 2g >>>>>>> XX:MaxMetaspaceSize 1g >>>>>>> >>>>>>> As you can see, the managed memory is 54g, so it's already high (my >>>>>>> managed.fraction is set to 0.5). >>>>>>> >>>>>>> I'm running Flink 1.10. Full job details attached. >>>>>>> >>>>>>> Can someone advise what would cause a managed memory leak? >>>>>>> >>>>>>> >>>>>>>
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
3:22 >>>> /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X >>>> yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 >>>> /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf >>>> root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 >>>> /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea >>>> root 1781 0.0 0.0 172644 8096 ?Ss Jul30 2:06 >>>> /usr/lib/systemd/systemd-journald >>>> root 4801 0.0 0.0 2690260 4776 ?Ssl Jul30 4:42 >>>> /usr/bin/amazon-ssm-agent >>>> root 6566 0.0 0.0 164672 4116 ?R00:30 0:00 ps >>>> auxwww --sort -rss >>>> root 6532 0.0 0.0 183124 3592 ?S00:30 0:00 >>>> /usr/sbin/CROND -n >>>> >>>> On Wed, Oct 28, 2020 at 11:34 AM Xintong Song >>>> wrote: >>>> >>>>> Hi Ori, >>>>> >>>>> The error message suggests that there's not enough physical memory on >>>>> the machine to satisfy the allocation. This does not necessarily mean a >>>>> managed memory leak. Managed memory leak is only one of the possibilities. >>>>> There are other potential reasons, e.g., another process/container on the >>>>> machine used more memory than expected, Yarn NM is not configured with >>>>> enough memory reserved for the system processes, etc. >>>>> >>>>> I would suggest to first look into the machine memory usages, see >>>>> whether the Flink process indeed uses more memory than expected. This >>>>> could >>>>> be achieved via: >>>>> - Run the `top` command >>>>> - Look into the `/proc/meminfo` file >>>>> - Any container memory usage metrics that are available to your Yarn >>>>> cluster >>>>> >>>>> Thank you~ >>>>> >>>>> Xintong Song >>>>> >>>>> >>>>> >>>>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski >>>>> wrote: >>>>> >>>>>> After the job is running for 10 days in production, TaskManagers >>>>>> start failing with: >>>>>> >>>>>> Connection unexpectedly closed by remote task manager >>>>>> >>>>>> Looking in the machine logs, I can see the following error: >>>>>> >>>>>> = Java processes for user hadoop = >>>>>> OpenJDK 64-Bit Server VM warning: INFO: >>>>>> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; >>>>>> error='Cannot >>>>>> allocate memory' (err >>>>>> # >>>>>> # There is insufficient memory for the Java Runtime Environment to >>>>>> continue. >>>>>> # Native memory allocation (mmap) failed to map 1006567424 bytes for >>>>>> committing reserved memory. >>>>>> # An error report file with more information is saved as: >>>>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>>>>> === End java processes for user hadoop === >>>>>> >>>>>> In addition, the metrics for the TaskManager show very low Heap >>>>>> memory consumption (20% of Xmx). >>>>>> >>>>>> Hence, I suspect there is a memory leak in the TaskManager's Managed >>>>>> Memory. >>>>>> >>>>>> This my TaskManager's memory detail: >>>>>> flink process 112g >>>>>> framework.heap.size 0.2g >>>>>> task.heap.size 50g >>>>>> managed.size 54g >>>>>> framework.off-heap.size 0.5g >>>>>> task.off-heap.size 1g >>>>>> network 2g >>>>>> XX:MaxMetaspaceSize 1g >>>>>> >>>>>> As you can see, the managed memory is 54g, so it's already high (my >>>>>> managed.fraction is set to 0.5). >>>>>> >>>>>> I'm running Flink 1.10. Full job details attached. >>>>>> >>>>>> Can someone advise what would cause a managed memory leak? >>>>>> >>>>>> >>>>>>
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
0:00 >>> /usr/sbin/CROND -n >>> >>> On Wed, Oct 28, 2020 at 11:34 AM Xintong Song >>> wrote: >>> >>>> Hi Ori, >>>> >>>> The error message suggests that there's not enough physical memory on >>>> the machine to satisfy the allocation. This does not necessarily mean a >>>> managed memory leak. Managed memory leak is only one of the possibilities. >>>> There are other potential reasons, e.g., another process/container on the >>>> machine used more memory than expected, Yarn NM is not configured with >>>> enough memory reserved for the system processes, etc. >>>> >>>> I would suggest to first look into the machine memory usages, see >>>> whether the Flink process indeed uses more memory than expected. This could >>>> be achieved via: >>>> - Run the `top` command >>>> - Look into the `/proc/meminfo` file >>>> - Any container memory usage metrics that are available to your Yarn >>>> cluster >>>> >>>> Thank you~ >>>> >>>> Xintong Song >>>> >>>> >>>> >>>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: >>>> >>>>> After the job is running for 10 days in production, TaskManagers start >>>>> failing with: >>>>> >>>>> Connection unexpectedly closed by remote task manager >>>>> >>>>> Looking in the machine logs, I can see the following error: >>>>> >>>>> = Java processes for user hadoop = >>>>> OpenJDK 64-Bit Server VM warning: INFO: >>>>> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; error='Cannot >>>>> allocate memory' (err >>>>> # >>>>> # There is insufficient memory for the Java Runtime Environment to >>>>> continue. >>>>> # Native memory allocation (mmap) failed to map 1006567424 bytes for >>>>> committing reserved memory. >>>>> # An error report file with more information is saved as: >>>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>>>> === End java processes for user hadoop === >>>>> >>>>> In addition, the metrics for the TaskManager show very low Heap memory >>>>> consumption (20% of Xmx). >>>>> >>>>> Hence, I suspect there is a memory leak in the TaskManager's Managed >>>>> Memory. >>>>> >>>>> This my TaskManager's memory detail: >>>>> flink process 112g >>>>> framework.heap.size 0.2g >>>>> task.heap.size 50g >>>>> managed.size 54g >>>>> framework.off-heap.size 0.5g >>>>> task.off-heap.size 1g >>>>> network 2g >>>>> XX:MaxMetaspaceSize 1g >>>>> >>>>> As you can see, the managed memory is 54g, so it's already high (my >>>>> managed.fraction is set to 0.5). >>>>> >>>>> I'm running Flink 1.10. Full job details attached. >>>>> >>>>> Can someone advise what would cause a managed memory leak? >>>>> >>>>> >>>>>
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
>> >>> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: >>> >>>> After the job is running for 10 days in production, TaskManagers start >>>> failing with: >>>> >>>> Connection unexpectedly closed by remote task manager >>>> >>>> Looking in the machine logs, I can see the following error: >>>> >>>> = Java processes for user hadoop = >>>> OpenJDK 64-Bit Server VM warning: INFO: >>>> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; error='Cannot >>>> allocate memory' (err >>>> # >>>> # There is insufficient memory for the Java Runtime Environment to >>>> continue. >>>> # Native memory allocation (mmap) failed to map 1006567424 bytes for >>>> committing reserved memory. >>>> # An error report file with more information is saved as: >>>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>>> === End java processes for user hadoop === >>>> >>>> In addition, the metrics for the TaskManager show very low Heap memory >>>> consumption (20% of Xmx). >>>> >>>> Hence, I suspect there is a memory leak in the TaskManager's Managed >>>> Memory. >>>> >>>> This my TaskManager's memory detail: >>>> flink process 112g >>>> framework.heap.size 0.2g >>>> task.heap.size 50g >>>> managed.size 54g >>>> framework.off-heap.size 0.5g >>>> task.off-heap.size 1g >>>> network 2g >>>> XX:MaxMetaspaceSize 1g >>>> >>>> As you can see, the managed memory is 54g, so it's already high (my >>>> managed.fraction is set to 0.5). >>>> >>>> I'm running Flink 1.10. Full job details attached. >>>> >>>> Can someone advise what would cause a managed memory leak? >>>> >>>> >>>> akka.ask.timeout, 1 min env.hadoop.conf.dir, /etc/hadoop/conf env.java.opts.jobmanager, -XX:+UseG1GC env.java.opts.taskmanager, -XX:+UseG1GC env.yarn.conf.dir, /etc/hadoop/conf execution.attached, false execution.checkpointing.externalized-checkpoint-retention, RETAIN_ON_CANCELLATION execution.checkpointing.interval, 10 min execution.checkpointing.min-pause, 5 min execution.checkpointing.mode, AT_LEAST_ONCE execution.checkpointing.timeout, 10 min execution.savepoint.ignore-unclaimed-state, false execution.savepoint.path, s3://***/savepoints/savepoint-c9d3de-dd7d4bef809f execution.shutdown-on-attached-exit, false execution.target, yarn-per-job flink.partition-discovery.interval-millis, 6 heartbeat.timeout, 12 high-availability.cluster-id, application_1600334141629_0017 internal.cluster.execution-mode, DETACHED jobmanager.heap.size, 24g metrics.reporter.prom.class, org.apache.flink.metrics.prometheus.PrometheusReporter parallelism.default, 216 pipeline.default-kryo-serializers, class:com.fasterxml.jackson.databind.JsonNode,serializer:walkme.flink.JsonNodeKryoSerializer pipeline.jars, [file:/home/hadoop/flink-***.jar] pipeline.registered-kryo-types, java.lang.Number;java.lang.Object pipeline.time-characteristic, EventTime restart-strategy, failure-rate restart-strategy.failure-rate.delay, 1 min restart-strategy.failure-rate.failure-rate-interval, 60 min restart-strategy.failure-rate.max-failures-per-interval, 70 state.backend, rocksdb state.backend.async, true state.backend.incremental, true state.backend.local-recovery, true state.checkpoints.dir, s3://***/***/checkpoints state.checkpoints.num-retained, 2 state.savepoints.dir, s3://***/***/savepoints taskmanager.cpu.cores, 7 taskmanager.memory.framework.off-heap.size, 512 mb taskmanager.memory.jvm-metaspace.size, 1 gb taskmanager.memory.jvm-overhead.max, 2 gb taskmanager.memory.managed.fraction, 0.5 taskmanager.memory.network.max, 2 gb taskmanager.memory.process.size, 112 gb taskmanager.memory.task.off-heap.size, 1 gb taskmanager.numberOfTaskSlots, 2 yarn.application-attempts, 1
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
Hi Ori, It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right? It would be helpful if you can briefly introduce your workload. - What kind of workload are you running? Streaming or batch? - Do you use RocksDB state backend? - Any UDFs or 3rd party dependencies that might allocate significant native memory? Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages. Thank you~ Xintong Song On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski wrote: > Hi Xintong, > > See here: > > # Top memory users > ps auxwww --sort -rss | head -10 > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 > /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max > root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 > /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X > hadoop5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 > /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X > yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 > /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf > root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 > /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea > root 1781 0.0 0.0 172644 8096 ?Ss Jul30 2:06 > /usr/lib/systemd/systemd-journald > root 4801 0.0 0.0 2690260 4776 ?Ssl Jul30 4:42 > /usr/bin/amazon-ssm-agent > root 6566 0.0 0.0 164672 4116 ?R00:30 0:00 ps auxwww > --sort -rss > root 6532 0.0 0.0 183124 3592 ?S00:30 0:00 > /usr/sbin/CROND -n > > On Wed, Oct 28, 2020 at 11:34 AM Xintong Song > wrote: > >> Hi Ori, >> >> The error message suggests that there's not enough physical memory on the >> machine to satisfy the allocation. This does not necessarily mean a managed >> memory leak. Managed memory leak is only one of the possibilities. There >> are other potential reasons, e.g., another process/container on the machine >> used more memory than expected, Yarn NM is not configured with enough >> memory reserved for the system processes, etc. >> >> I would suggest to first look into the machine memory usages, see whether >> the Flink process indeed uses more memory than expected. This could be >> achieved via: >> - Run the `top` command >> - Look into the `/proc/meminfo` file >> - Any container memory usage metrics that are available to your Yarn >> cluster >> >> Thank you~ >> >> Xintong Song >> >> >> >> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: >> >>> After the job is running for 10 days in production, TaskManagers start >>> failing with: >>> >>> Connection unexpectedly closed by remote task manager >>> >>> Looking in the machine logs, I can see the following error: >>> >>> = Java processes for user hadoop ===== >>> OpenJDK 64-Bit Server VM warning: INFO: >>> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; error='Cannot >>> allocate memory' (err >>> # >>> # There is insufficient memory for the Java Runtime Environment to >>> continue. >>> # Native memory allocation (mmap) failed to map 1006567424 bytes for >>> committing reserved memory. >>> # An error report file with more information is saved as: >>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >>> === End java processes for user hadoop === >>> >>> In addition, the metrics for the TaskManager show very low Heap memory >>> consumption (20% of Xmx). >>> >>> Hence, I suspect there is a memory leak in the TaskManager's Managed >>> Memory. >>> >>> This my TaskManager's memory detail: >>> flink process 112g >>> framework.heap.size 0.2g >>> task.heap.size 50g >>> managed.size 54g >>> framework.off-heap.size 0.5g >>> task.off-heap.size 1g >>> network 2g >>> XX:MaxMetaspaceSize 1g >>> >>> As you can see, the managed memory is 54g, so it's already high (my >>> managed.fraction is set to 0.5). >>> >>> I'm running Flink 1.10. Full job details attached. >>> >>> Can someone advise what would cause a managed memory leak? >>> >>> >>>
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
Hi Xintong, See here: # Top memory users ps auxwww --sort -rss | head -10 USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X hadoop5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea root 1781 0.0 0.0 172644 8096 ?Ss Jul30 2:06 /usr/lib/systemd/systemd-journald root 4801 0.0 0.0 2690260 4776 ?Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent root 6566 0.0 0.0 164672 4116 ?R00:30 0:00 ps auxwww --sort -rss root 6532 0.0 0.0 183124 3592 ?S00:30 0:00 /usr/sbin/CROND -n On Wed, Oct 28, 2020 at 11:34 AM Xintong Song wrote: > Hi Ori, > > The error message suggests that there's not enough physical memory on the > machine to satisfy the allocation. This does not necessarily mean a managed > memory leak. Managed memory leak is only one of the possibilities. There > are other potential reasons, e.g., another process/container on the machine > used more memory than expected, Yarn NM is not configured with enough > memory reserved for the system processes, etc. > > I would suggest to first look into the machine memory usages, see whether > the Flink process indeed uses more memory than expected. This could be > achieved via: > - Run the `top` command > - Look into the `/proc/meminfo` file > - Any container memory usage metrics that are available to your Yarn > cluster > > Thank you~ > > Xintong Song > > > > On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: > >> After the job is running for 10 days in production, TaskManagers start >> failing with: >> >> Connection unexpectedly closed by remote task manager >> >> Looking in the machine logs, I can see the following error: >> >> = Java processes for user hadoop = >> OpenJDK 64-Bit Server VM warning: INFO: >> os::commit_memory(0x7fb4f401, 1006567424, 0) failed; error='Cannot >> allocate memory' (err >> # >> # There is insufficient memory for the Java Runtime Environment to >> continue. >> # Native memory allocation (mmap) failed to map 1006567424 bytes for >> committing reserved memory. >> # An error report file with more information is saved as: >> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log >> === End java processes for user hadoop === >> >> In addition, the metrics for the TaskManager show very low Heap memory >> consumption (20% of Xmx). >> >> Hence, I suspect there is a memory leak in the TaskManager's Managed >> Memory. >> >> This my TaskManager's memory detail: >> flink process 112g >> framework.heap.size 0.2g >> task.heap.size 50g >> managed.size 54g >> framework.off-heap.size 0.5g >> task.off-heap.size 1g >> network 2g >> XX:MaxMetaspaceSize 1g >> >> As you can see, the managed memory is 54g, so it's already high (my >> managed.fraction is set to 0.5). >> >> I'm running Flink 1.10. Full job details attached. >> >> Can someone advise what would cause a managed memory leak? >> >> >>
Re: Native memory allocation (mmap) failed to map 1006567424 bytes
Hi Ori, The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc. I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via: - Run the `top` command - Look into the `/proc/meminfo` file - Any container memory usage metrics that are available to your Yarn cluster Thank you~ Xintong Song On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski wrote: > After the job is running for 10 days in production, TaskManagers start > failing with: > > Connection unexpectedly closed by remote task manager > > Looking in the machine logs, I can see the following error: > > = Java processes for user hadoop = > OpenJDK 64-Bit Server VM warning: INFO: > os::commit_memory(0x7fb4f401, 1006567424, 0) failed; error='Cannot > allocate memory' (err > # > # There is insufficient memory for the Java Runtime Environment to > continue. > # Native memory allocation (mmap) failed to map 1006567424 bytes for > committing reserved memory. > # An error report file with more information is saved as: > # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log > === End java processes for user hadoop === > > In addition, the metrics for the TaskManager show very low Heap memory > consumption (20% of Xmx). > > Hence, I suspect there is a memory leak in the TaskManager's Managed > Memory. > > This my TaskManager's memory detail: > flink process 112g > framework.heap.size 0.2g > task.heap.size 50g > managed.size 54g > framework.off-heap.size 0.5g > task.off-heap.size 1g > network 2g > XX:MaxMetaspaceSize 1g > > As you can see, the managed memory is 54g, so it's already high (my > managed.fraction is set to 0.5). > > I'm running Flink 1.10. Full job details attached. > > Can someone advise what would cause a managed memory leak? > > >
Native memory allocation (mmap) failed to map 1006567424 bytes
After the job is running for 10 days in production, TaskManagers start failing with: Connection unexpectedly closed by remote task manager Looking in the machine logs, I can see the following error: = Java processes for user hadoop = OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7fb4f401, 1006567424, 0) failed; error='Cannot allocate memory' (err # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory. # An error report file with more information is saved as: # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log === End java processes for user hadoop === In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx). Hence, I suspect there is a memory leak in the TaskManager's Managed Memory. This my TaskManager's memory detail: flink process 112g framework.heap.size 0.2g task.heap.size 50g managed.size 54g framework.off-heap.size 0.5g task.off-heap.size 1g network 2g XX:MaxMetaspaceSize 1g As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5). I'm running Flink 1.10. Full job details attached. Can someone advise what would cause a managed memory leak? Starting YARN TaskExecutor runner (Version: 1.10.0, Rev:, Date:) OS current user: yarn Current Hadoop/Kerberos user: hadoop JVM: OpenJDK 64-Bit Server VM - Amazon.com Inc. - 1.8/25.252-b09 Maximum heap size: 52224 MiBytes JAVA_HOME: /etc/alternatives/jre Hadoop version: 2.8.5-amzn-6 JVM Options: -Xmx54760833024 -Xms54760833024 -XX:MaxDirectMemorySize=3758096384 -XX:MaxMetaspaceSize=1073741824 -XX:+UseG1GC -Dlog.file=/var/log/hadoop-yarn/containers/application_1600334141629_0011/container_1600334141629_0011_01_02/taskmanager.log -Dlog4j.configuration=file:./log4j.properties Program Arguments: -D taskmanager.memory.framework.off-heap.size=536870912b -D taskmanager.memory.network.max=2147483648b -D taskmanager.memory.network.min=2147483648b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=58518929408b -D taskmanager.cpu.cores=7.0 -D taskmanager.memory.task.heap.size=54626615296b -D taskmanager.memory.task.off-heap.size=1073741824b --configDir . -Djobmanager.rpc.address=ip-***.us-west-2.compute.internal -Dweb.port=0 -Dweb.tmpdir=/tmp/flink-web-ad601f25-685f-42e5-aa93-9658233031e4 -Djobmanager.rpc.port=35435 -Drest.address=ip-***.us-west-2.compute.internal