[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17336318#comment-17336318 ] Flink Jira Bot commented on FLINK-15906: This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > Labels: stale-major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.err > {code} > I suspect there are some leaks o
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17328066#comment-17328066 ] Flink Jira Bot commented on FLINK-15906: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > Labels: stale-major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243033#comment-17243033 ] yang gang commented on FLINK-15906: --- Hi,[~xintongsong] Thank you very much. By increasing the value of JVM overhead (taskmanager.memory.jvm-overhead.fraction=0.3), it has been observed that there is no exception physical memory exceed. > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.err > {code} > I suspect there are some leaks or unexpected offheap memory usage. -- This message was sent by Atlassian Jira (v8.3.4#
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242974#comment-17242974 ] Xintong Song commented on FLINK-15906: -- The exception suggests that the task manager is consuming more memory than expected. A java program may consume various types of memory: heap, direct, native, metaspace. For all the types, except for native memory, Flink sets explicit upper limits via JVM parameters, so that an `OutOfMemoryError` will be thrown if the process tries to use more memory than the limit. Since there's no OOM thrown, the only possibility is that Flink uses more native memory than it planned. Increasing JVM overhead, Flink will reserve more native memory in the container. The extra memory may not be actually used by JVM as its overhead, but should help with your problem. BTW, did it solves your problem? > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmana
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242968#comment-17242968 ] yang gang commented on FLINK-15906: --- Hi [~xintongsong], please teach me,What is the relationship between this exception and this configuration option? thanks > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.err > {code} > I suspect there are some leaks or unexpected offheap memory usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17230365#comment-17230365 ] Xintong Song commented on FLINK-15906: -- Hi [~清月], If the problem does not happen frequently, I would suggest to first try configure a larger JVM overhead memory size. The configuration options are `taskmanager.memory.jvm-overhead.[min|max|fraction]`. https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup.html#capped-fractionated-components > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229817#comment-17229817 ] yang gang commented on FLINK-15906: --- {code:java} Closing TaskExecutor connection container_1597847003686_0079_01_000121. Because: Container [pid=4269,containerID=container_1597847003686_0079_01_000121] is running beyond physical memory limits. Current usage: 20.0 GB of 20 GB physical memory used; 24.9 GB of 100 GB virtual memory used. Killing container. Dump of the process-tree for container_1597847003686_0079_01_000121 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 4298 4269 4269 4269 (java) 104835705 33430931 26634625024 5242644 /usr/local/jdk1.8/bin/java -Xmx10871635848 -Xms10871635848 -XX:MaxDirectMemorySize=1207959552 -XX:MaxMetaspaceSize=268435456 -server -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -XX:ParallelGCThreads=4 -XX:+AlwaysPreTouch -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=3 -DjobName=ck_local_growthline_new-10 -Dlog.file=/data2/yarn/containers/application_1597847003686_0079/container_1597847003686_0079_01_000121/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=1073741824b -D taskmanager.memory.network.min=1073741824b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=8053063800b -D taskmanager.cpu.cores=10.0 -D taskmanager.memory.task.heap.size=10737418120b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Djobmanager.rpc.address={address} -Dweb.port=0 -Dweb.tmpdir=/tmp/flink-web-0874be2a-720d-443c-a069-0bb1fad69433 -Djobmanager.rpc.port=36047 -Drest.address={address} |- 4269 4267 4269 4269 (bash) 0 0 115904512 359 /bin/bash -c /usr/local/jdk1.8/bin/java -Xmx10871635848 -Xms10871635848 -XX:MaxDirectMemorySize=1207959552 -XX:MaxMetaspaceSize=268435456 -server -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -XX:ParallelGCThreads=4 -XX:+AlwaysPreTouch -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=3 -DjobName=ck_local_growthline_new-10 -Dlog.file=/data2/yarn/containers/application_1597847003686_0079/container_1597847003686_0079_01_000121/taskmanager.log -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=1073741824b -D taskmanager.memory.network.min=1073741824b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=8053063800b -D taskmanager.cpu.cores=10.0 -D taskmanager.memory.task.heap.size=10737418120b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Djobmanager.rpc.address={address} -Dweb.port='0' -Dweb.tmpdir='/tmp/flink-web-0874be2a-720d-443c-a069-0bb1fad69433' -Djobmanager.rpc.port='36047' -Drest.address={address} 1> /data2/yarn/containers/application_1597847003686_0079/container_1597847003686_0079_01_000121/taskmanager.out 2> /data2/yarn/containers/application_1597847003686_0079/container_1597847003686_0079_01_000121/taskmanager.err {code} [~xintongsong] I have also encountered this kind of problem. This is a task of calculating DAU indicators. But this exception does not happen frequently. I have observed the memory metrics and logs of this task, but have not found useful information, so I would like to ask you how to solve this problem? > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186406#comment-17186406 ] Till Rohrmann commented on FLINK-15906: --- [~liupengcheng] did you find anything? Or has the problem been resolved in the meantime? > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.err > {code} > I suspect there are some leaks or unexpected offheap memory usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039661#comment-17039661 ] liupengcheng commented on FLINK-15906: -- [~fly_in_gis] Thanks for giving this link, I've planed to debug this with native memory tacking or heapdump. > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.err > {code} > I suspect there are some leaks or unexpected offheap memory usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039657#comment-17039657 ] liupengcheng commented on FLINK-15906: -- [~fly_in_gis][~xintongsong]This issue happens quite often in our TPC-DS tests, the memory config already described in the jira description. I don't think this is caused by jvm overhead, like I already noted in the jira, we adjust the overhead even to 1 GB, but it may still fail occasionally. > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.out > 2> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.err > {code} > I suspect there are so
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031569#comment-17031569 ] Yang Wang commented on FLINK-15906: --- Even the taskmanager is killed by Yarn because of running beyond the limit, it does not mean that we have memory leak here. Since the heap memory and direct memory could be controlled by JVM. Maybe you do not set enough memory for native memory. Firstly, i suggest you to increase the jvm-overhead enough to make sure the taskmanager is not killed. Then use the native memory tracking[1] to debug the memory usage. I think you should find something. [1]. [https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html] > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> > /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_47
[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn
[ https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031251#comment-17031251 ] Xintong Song commented on FLINK-15906: -- The launching command shows that the calculated memory sizes and JVM arguments aligns with the configuration you provided. [~lzljs3620320], could you share some experience here? Have you ever run into such problems when running TPC-DS? And what are your memory configurations when running TPC-DS? [~liupengcheng], I have a few further questions. - How often do you run into this issue? Always, frequently, occasionally, or just once? - When exactly does this happen? Is it right after the cluster is started, right after starting to process TPC-DS queries, running the queries for a while, or always happen when executing some specific the queries? > physical memory exceeded causing being killed by yarn > - > > Key: FLINK-15906 > URL: https://issues.apache.org/jira/browse/FLINK-15906 > Project: Flink > Issue Type: Bug >Reporter: liupengcheng >Priority: Major > > Recently, we encoutered this issue when testing TPCDS query with 100g data. > I first meet this issue when I only set the > `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try > to increase the jvmOverhead size with following arguments, but still failed. > {code:java} > taskmanager.memory.jvm-overhead.min: 640m > taskmanager.memory.jvm-metaspace: 128m > taskmanager.memory.task.heap.size: 1408m > taskmanager.memory.framework.heap.size: 128m > taskmanager.memory.framework.off-heap.size: 128m > taskmanager.memory.managed.size: 1408m > taskmanager.memory.shuffle.max: 256m > {code} > {code:java} > java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.java.lang.Exception: [2020-02-05 11:31:32.345]Container > [pid=101677,containerID=container_e08_1578903621081_4785_01_51] is > running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB > of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing > container.Dump of the process-tree for > container_e08_1578903621081_4785_01_51 : |- PID PPID PGRPID SESSID > CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) > RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 > 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) > 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java > -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 > -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 > -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_51/taskmanager.log > -Dlog4j.configuration=file:./log4j.properties > org.apache.flink.yarn.YarnTaskExecutorRunner -D > taskmanager.memory.shuffle.max=268435456b -D > taskmanager.memory.framework.off-heap.size=134217728b -D > taskmanager.memory.framework.heap.size=134217728b -D > taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D > taskmanager.memory.task.heap.size=1476395008b -D > taskmanager.memory.task.off-heap.size=0b -D > taskmanager.memory.shuffle.min=268435456b --configDir . > -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 > -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb > -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b > -Drest.address=zjy-hadoop-prc-st2805.bj 1> >