Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
yarn     20339 35.8 97.0 128600192 126672256 ? Sl   Oct15 5975:47
/etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root      5245  0.1  0.4 5580484 627436 ?      Sl   Jul30 144:39
/etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop    5252  0.1  0.4 7376768 604772 ?      Sl   Jul30 153:22
/etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn     26857  0.3  0.2 4214784 341464 ?      Sl   Sep17 198:43
/etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root      5519  0.0  0.2 5658624 269344 ?      Sl   Jul30  45:21
/usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root      1781  0.0  0.0 172644  8096 ?        Ss   Jul30   2:06
/usr/lib/systemd/systemd-journald
root      4801  0.0  0.0 2690260 4776 ?        Ssl  Jul30   4:42
/usr/bin/amazon-ssm-agent
root      6566  0.0  0.0 164672  4116 ?        R    00:30   0:00 ps auxwww
--sort -rss
root      6532  0.0  0.0 183124  3592 ?        S    00:30   0:00
/usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <tonysong...@gmail.com> wrote:

> Hi Ori,
>
> The error message suggests that there's not enough physical memory on the
> machine to satisfy the allocation. This does not necessarily mean a managed
> memory leak. Managed memory leak is only one of the possibilities. There
> are other potential reasons, e.g., another process/container on the machine
> used more memory than expected, Yarn NM is not configured with enough
> memory reserved for the system processes, etc.
>
> I would suggest to first look into the machine memory usages, see whether
> the Flink process indeed uses more memory than expected. This could be
> achieved via:
> - Run the `top` command
> - Look into the `/proc/meminfo` file
> - Any container memory usage metrics that are available to your Yarn
> cluster
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <ori....@gmail.com> wrote:
>
>> After the job is running for 10 days in production, TaskManagers start
>> failing with:
>>
>> Connection unexpectedly closed by remote task manager
>>
>> Looking in the machine logs, I can see the following error:
>>
>> ============= Java processes for user hadoop =============
>> OpenJDK 64-Bit Server VM warning: INFO:
>> os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot
>> allocate memory' (err
>> #
>> # There is insufficient memory for the Java Runtime Environment to
>> continue.
>> # Native memory allocation (mmap) failed to map 1006567424 bytes for
>> committing reserved memory.
>> # An error report file with more information is saved as:
>> # /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
>> =========== End java processes for user hadoop ===========
>>
>> In addition, the metrics for the TaskManager show very low Heap memory
>> consumption (20% of Xmx).
>>
>> Hence, I suspect there is a memory leak in the TaskManager's Managed
>> Memory.
>>
>> This my TaskManager's memory detail:
>> flink process 112g
>> framework.heap.size 0.2g
>> task.heap.size 50g
>> managed.size 54g
>> framework.off-heap.size 0.5g
>> task.off-heap.size 1g
>> network 2g
>> XX:MaxMetaspaceSize 1g
>>
>> As you can see, the managed memory is 54g, so it's already high (my
>> managed.fraction is set to 0.5).
>>
>> I'm running Flink 1.10. Full job details attached.
>>
>> Can someone advise what would cause a managed memory leak?
>>
>>
>>

Reply via email to