[ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983106#comment-16983106
 ] 

Yang Wang commented on FLINK-14952:
-----------------------------------

[~pnowojski]

FYI, all the memory used by Flink TaskManager should beĀ counted and allocate 
from Yarn ResourceManager. Otherwise, it may be killed by Yarn NodeManager. The 
NodeManager tracks the memory usage of the TaskManager process tree. It is 
usually rss mem, read the value from `/proc/\{pid}/stat'. Yarn NodeManager does 
not use cgroup to set the memory limit, it starts a `ContainersMonitor` thread 
to monitor the memory usage of each container. If it overused, the container 
will be killed by NodeManager and get the log "is running beyond physical 
memory limits".

> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-14952
>                 URL: https://issues.apache.org/jira/browse/FLINK-14952
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / Network
>    Affects Versions: 1.9.1
>            Reporter: Piotr Nowojski
>            Priority: Blocker
>             Fix For: 1.10.0
>
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_000004 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_000004] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to