@tonysong...@gmail.com 感谢回复 看了下参数的含义, taskmanager.memory.off-heap: 如果设置为true,TaskManager分配用于排序,hash表和缓存中间结果的内存位于JVM堆外。对于具有较大内存的设置,这可以提高在内存上执行的操作的效率(默认为false)。 JVM堆内使用的内存是受YARN限制的,JVM堆外不受YARN限制,如果这样确实能 说通现在我的问题, 已经修改并且在测试了,非常感谢tonysong...@gmail.com 咱们FLINK有没有一些最佳实践的项目样例,能体现一些细节上的东西,能让大家用的更简单一些,体现FLINK的强大。
在 2019-12-17 18:16:02,"Xintong Song" <tonysong...@gmail.com> 写道: >你这个不是OOM,是 container 内存超用被 yarn 杀掉了。 >JVM 的内存是不可能超用的,否则会报 OOM。所以比较可能是 RocksDB 的内存够用量增加导致了超用。 > >建议: > >1. 增加如下配置 >taskmanager.memory.off-heap: true >taskmanager.memory.preallocate: false > >2. 若果已经采用了如下配置,或者改了配置之后仍存在问题,可以尝试调大下面这个配置,未配置时默认值是0.25 >containerized.heap-cutoff-ratio > >Thank you~ > >Xintong Song > > > >On Tue, Dec 17, 2019 at 5:49 PM USERNAME <oracle...@126.com> wrote: > >> 版本:flink 1.9.1 >> --运行命令 >> flink run -d -m yarn-cluster -yn 40 -ys 2 **** >> >> >> --部分代码 >> env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); >> RocksDBStateBackend backend = new RocksDBStateBackend(CHECKPOINT_PATH, >> true); >> >> >> .keyBy("imei") //10W+ >> .window(EventTimeSessionWindows.withGap(Time.hours(1))) //设备超过1小时没有点就算离线 >> .trigger(new Trigger()) >> .aggregate(new AggregateFunction(), new ProcessWindowFunction()) >> >> >> --数据 >> 总共10W+设备,每个设备每30秒一条数据,一分钟数据量20W左右。 >> >> >> --错误现象 >> 运行一段时间(几天)之后,taskmanager就会挂掉。 >> >> >> --求教 >> 1. flink 内存不断增加? >> 数据量是挺大的,并且窗口保留期可能会很长,但是实际数据运算一次就可以不用了,也做了StateTtlConfig 不知道 >> 哪里?什么?导致的内存一直占用,可能用法有问题,希望大神能够指点一下迷津。 >> 2. flink / yarn 参数配置能优化吗? >> 有flink on yarn 的配置最佳实践吗? >> >> >> 问题困扰很久了 从1.7 - 1.8 - 1.9 ,希望有熟悉内部机制和有过类似问题的大神指点一下。 >> >> >> >> >> --错误信息 --> nodemanager .log >> >> >> 2019-12-17 16:55:16,545 WARN >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Process tree for container: container_e16_1575354121024_0050_01_000008 has >> processes older than 1 iteration running over the configured limit. >> Limit=3221225472, current usage = 3222126592 >> 2019-12-17 16:55:16,546 WARN >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Container >> [pid=184523,containerID=container_e16_1575354121024_0050_01_000008] is >> running 901120B beyond the 'PHYSICAL' memory limit. Current usage: 3.0 GB >> of 3 GB physical memory used; 4.9 GB of 30 GB virtual memory used. Killing >> container. >> Dump of the process-tree for container_e16_1575354121024_0050_01_000008 : >> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) >> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE >> |- 184701 184523 184523 184523 (java) 21977 4845 5166649344 786279 >> /usr/local/jdk1.8.0_171/bin/java -Xms2224m -Xmx2224m >> -XX:MaxDirectMemorySize=848m -XX:NewRatio=2 -XX:+UseConcMarkSweepGC >> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly >> -XX:+AlwaysPreTouch -server -XX:+HeapDumpOnOutOfMemoryError >> -Dlog.file=/opt/hadoop/logs/userlogs/application_1575354121024_0050/container_e16_1575354121024_0050_01_000008/taskmanager.log >> -Dlogback.configurationFile=file:./logback.xml >> -Dlog4j.configuration=file:./log4j.properties >> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . >> |- 184523 184521 184523 184523 (bash) 2 3 118067200 373 /bin/bash -c >> /usr/local/jdk1.8.0_171/bin/java -Xms2224m -Xmx2224m >> -XX:MaxDirectMemorySize=848m -XX:NewRatio=2 -XX:+UseConcMarkSweepGC >> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly >> -XX:+AlwaysPreTouch -server -XX:+HeapDumpOnOutOfMemoryError >> -Dlog.file=/opt/hadoop/logs/userlogs/application_1575354121024_0050/container_e16_1575354121024_0050_01_000008/taskmanager.log >> -Dlogback.configurationFile=file:./logback.xml >> -Dlog4j.configuration=file:./log4j.properties >> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> >> /opt/hadoop/logs/userlogs/application_1575354121024_0050/container_e16_1575354121024_0050_01_000008/taskmanager.out >> 2> >> /opt/hadoop/logs/userlogs/application_1575354121024_0050/container_e16_1575354121024_0050_01_000008/taskmanager.err >> >> >> >> 2019-12-17 16:55:16,546 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >> Removed ProcessTree with root 184523 >> 2019-12-17 16:55:16,547 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: >> Container container_e16_1575354121024_0050_01_000008 transitioned from >> RUNNING to KILLING >> 2019-12-17 16:55:16,549 INFO >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: >> Cleaning up container container_e16_1575354121024_0050_01_000008 >> 2019-12-17 16:55:16,579 WARN >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit >> code from container container_e16_1575354121024_0050_01_000008 is : 143