[ https://issues.apache.org/jira/browse/HIVE-27695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773984#comment-17773984 ]
Stamatis Zampetakis commented on HIVE-27695: -------------------------------------------- The OOM raised from the AM shows that GC is consuming almost the entire CPU time trying to free up some memory unsuccessfully. The problem also shows up in CPU flamegraphs obtained via async-profiler and it is also present when enabling gc verbose logs. In the GC logs, we can clearly see that there are many "Full GC" cycles that are not really freeing up any memory. {noformat} Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded {noformat} The heap dump shows that a big amount of memory is occupied by configurations objects: * org.apache.hadoop.conf.Configuration occupy (32% of the heap) * org.apache.hadoop.mapred.JobConf (17% of the heap) The biggest configuration objects retain roughly ~1MB of memory and these are used extensively in the application. By analyzing the heapdump we can observe that each HeldContainer holds around 10 configuration objects (among other things) so we can roughly estimate that each container requires 10MB of heap memory in the application master. {noformat} Class Name | Objects | Referenced Objects | Ref. Shallow Heap ----------------------------------------------------------------------------------------------------------------------------- org.apache.tez.dag.app.rm.YarnTaskSchedulerService$DelayedContainerManager| 1 | 40 | 1,920 '- java.util.concurrent.PriorityBlockingQueue | 1 | 30 | 1,440 '- java.lang.Object[] | 1 | 30 | 1,440 '- org.apache.tez.dag.app.rm.YarnTaskSchedulerService$HeldContainer | 3 | 30 | 1,440 '- org.apache.tez.dag.app.ContainerContext | 3 | 30 | 1,440 '- org.apache.tez.dag.app.dag.impl.VertexImpl | 3 | 30 | 1,440 |- org.apache.tez.dag.app.dag.impl.DAGImpl | 2 | 16 | 768 |- java.util.HashMap | 2 | 6 | 288 |- org.apache.hadoop.conf.Configuration | 5 | 5 | 240 '- org.apache.tez.dag.app.dag.impl.VertexManager | 3 | 3 | 144 ----------------------------------------------------------------------------------------------------------------------------- {noformat} In the heap dump we can see four containers held by the AM so this can easily mean ~40MB of constant memory consumption. Depending on the environment we can have more Tez containers up and running which further increases the memory demands of the AM. All the elements that were used in the analysis above are attached inside the [^dag_am_debug_bundles.tar.xz] file. Notably, the heap dump from the OOM, the gclogs and allocation profiling for the misbehaving AM can be found under dag_am_128xmx_oom directory. In order to avoid the OOM we can do one of the following: # limit container reuse and thus reduce the number of active containers tracked by the AM (tez.am.container.reuse.enabled=false) # increase the maximun heap space (xmx) for the AM to alleviate the GC pressure due to the number of active containers (tez.am.resource.memory.mb=256/512) # limit the maximum number of containers by tuning the appropriate Yarn properties I've tested two out of three from the above options monitoring GC activity, CPU usage, and total test execution time and the best option is to bump the maximun heap for the AM to 512MB; using this heap size GC activity remains pretty low, tests run in 8:13 minutes, and obviously there is no OOM. Bumping up the heap to 256 works also pretty well although tests are slightly slower (vs. 512MB) taking ~8:27 minutes. Disabling the container reuse while keeping the max heap constant at 128MB also solves the OOM but the duration of the tests is very long (~42 minutes) due to the overhead introduced by creating new containers all the time. The gclogs as well as the CPU profiling from the three solutions outlined above are also present inside the dag_am_debug_bundles.tar.gz file. > Intermittent OOM when running TestMiniTezCliDriver > -------------------------------------------------- > > Key: HIVE-27695 > URL: https://issues.apache.org/jira/browse/HIVE-27695 > Project: Hive > Issue Type: Bug > Components: Test > Affects Versions: 4.0.0-beta-1 > Reporter: Stamatis Zampetakis > Assignee: Stamatis Zampetakis > Priority: Major > Attachments: am_heap_dumps.tar.xz, dag_am_debug_bundles.tar.xz, > leak_suspect_1.png > > > Running all the tests under TestMiniTezCliDriver very frequently (but still > intermittently) leads to OutOfMemory errors. > {noformat} > cd itests/qtest && mvn test -Dtest=TestMiniTezCliDriver > {noformat} > I set {{-XX:+HeapDumpOnOutOfMemoryError}} and the respective heapdumps are > attached to this ticket. > The OOM is thrown from the application master and a quick inspection of the > dumps shows that it comes mainly from the accumulation of Configuration > objects (~1MB each) by various classes. > The max heap size for application master is pretty low (~100MB) so it is > quite easy to reach. The heap size is explicitly very low for testing > purposes but maybe we should re-evaluate the current configurations for the > tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)