[ https://issues.apache.org/jira/browse/HIVE-26404?focusedWorklogId=810371&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-810371 ]
ASF GitHub Bot logged work on HIVE-26404: ----------------------------------------- Author: ASF GitHub Bot Created on: 20/Sep/22 12:32 Start Date: 20/Sep/22 12:32 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3514: URL: https://github.com/apache/hive/pull/3514#issuecomment-1252287854 Kudos, SonarCloud Quality Gate passed! [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=3514) [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=BUG) [![C](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/C-16px.png 'C')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=BUG) [2 Bugs](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=BUG) [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=VULNERABILITY) [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3514&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3514&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3514&resolved=false&types=SECURITY_HOTSPOT) [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=CODE_SMELL) [10 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3514&resolved=false&types=CODE_SMELL) [![No Coverage information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/NoCoverageInfo-16px.png 'No Coverage information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3514&metric=coverage&view=list) No Coverage information [![No Duplication information](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/NoDuplicationInfo-16px.png 'No Duplication information')](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3514&metric=duplicated_lines_density&view=list) No Duplication information Issue Time Tracking ------------------- Worklog Id: (was: 810371) Time Spent: 1h 20m (was: 1h 10m) > HMS memory leak when compaction cleaner fails to remove obsolete files > ---------------------------------------------------------------------- > > Key: HIVE-26404 > URL: https://issues.apache.org/jira/browse/HIVE-26404 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 4.0.0-alpha-1 > Reporter: Stamatis Zampetakis > Assignee: Stamatis Zampetakis > Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > While investigating an issue where HMS becomes unresponsive we noticed a lot > of failed attempts from the compaction Cleaner thread to remove obsolete > directories with exceptions similar to the one below. > {noformat} > 2022-06-16 05:48:24,819 ERROR > org.apache.hadoop.hive.ql.txn.compactor.Cleaner: [Cleaner-executor-thread-0]: > Caught exception when cleaning, unable to complete cleaning of > id:4410976,dbname:my_database,tableName:my_table,partName:day=20220502,state:,type:MAJOR,enqueueTime:0,start:0,properties:null,runAs:some_user,tooManyAborts:false,hasOldAbort:false,highestWriteId:187502,errorMessage:null > java.io.IOException: Not enough history available for (187502,x). Oldest > available base: > hdfs://nameservice1/warehouse/tablespace/managed/hive/my_database.db/my_table/day=20220502/base_0188687_v4297872 > at > org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1432) > at > org.apache.hadoop.hive.ql.txn.compactor.Cleaner.removeFiles(Cleaner.java:261) > at > org.apache.hadoop.hive.ql.txn.compactor.Cleaner.access$000(Cleaner.java:71) > at > org.apache.hadoop.hive.ql.txn.compactor.Cleaner$1.run(Cleaner.java:203) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at > org.apache.hadoop.hive.ql.txn.compactor.Cleaner.clean(Cleaner.java:200) > at > org.apache.hadoop.hive.ql.txn.compactor.Cleaner.lambda$run$0(Cleaner.java:105) > at > org.apache.hadoop.hive.ql.txn.compactor.CompactorUtil$ThrowingRunnable.lambda$unchecked$0(CompactorUtil.java:54) > at > java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > In addition the logs contained a large number of long JVM pauses as shown > below and the HMS (RSZ) memory kept increasing at rate of 90MB per hour. > {noformat} > 2022-06-16 16:17:17,805 WARN > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 34346ms > 2022-06-16 16:17:21,497 INFO > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 1690ms > 2022-06-16 16:17:57,696 WARN > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 34697ms > 2022-06-16 16:18:01,326 INFO > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 1628ms > 2022-06-16 16:18:37,280 WARN > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 34453ms > 2022-06-16 16:18:40,927 INFO > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 1646ms > 2022-06-16 16:19:16,929 WARN > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 33997ms > 2022-06-16 16:19:20,572 INFO > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 1637ms > 2022-06-16 16:20:01,643 WARN > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 39329ms > 2022-06-16 16:20:05,572 INFO > org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor: > [org.apache.hadoop.hive.metastore.metrics.JvmPauseMonitor$Monitor@5b022296]: > Detected pause in JVM or host machine (eg GC): pause of approximately 1927ms > {noformat} > We took a heapdump of the HMS around the time that it becomes unresponsive > and we have seen many Configuration objects (~40K) occupying more than 90% of > the current heap (~9GB). > {noformat} > Class Name | Objects | Shallow Heap | > Retained Heap > ---------------------------------------------------------------------------------------------- > org.apache.hadoop.conf.Configuration | 39,452 | 1,893,696 | > >= 8,560,573,960 > java.util.concurrent.ConcurrentHashMap | 155,863 | 9,975,232 | > >= 4,696,003,968 > java.util.concurrent.ConcurrentHashMap$Node[]| 139,348 | 1,312,967,944 | > >= 4,686,230,296 > java.util.Properties | 87,119 | 4,181,712 | > >= 4,193,638,904 > java.util.Hashtable$Entry[] | 87,840 | 987,968,472 | > >= 4,189,518,928 > java.util.concurrent.ConcurrentHashMap$Node | 99,097,078 | 3,171,106,496 | > >= 3,375,319,552 > java.util.Hashtable$Entry | 100,047,081 | 3,201,506,592 | > >= 3,201,551,936 > org.postgresql.jdbc.PgConnection | 6,488 | 830,464 | > >= 551,442,952 > ---------------------------------------------------------------------------------------------- > {noformat} > It turns out that these Configuration objects are all referenced by CACHE > entries in org.apache.hadoop.fs.FileSystem$Cache. > {noformat} > Class Name > | Shallow Heap | Retained Heap > ---------------------------------------------------------------------------------------------------------------------- > org.apache.hadoop.fs.FileSystem$Cache @ 0x45403fe70 > | 32 | 108,671,824 > |- <class> class org.apache.hadoop.fs.FileSystem$Cache @ 0x45410c3e0 > | 8 | 544 > '- map java.util.HashMap @ 0x453ffb598 > | 48 | 92,777,232 > |- <class> class java.util.HashMap @ 0x4520382c8 System Class > | 40 | 168 > |- entrySet java.util.HashMap$EntrySet @ 0x454077848 > | 16 | 16 > '- table java.util.HashMap$Node[32768] @ 0x463585b68 > | 131,088 | 92,777,168 > |- class java.util.HashMap$Node[] @ 0x4520b7790 > | 0 | 0 > '- [1786] java.util.HashMap$Node @ 0x451998ce0 > | 32 | 9,968 > |- <class> class java.util.HashMap$Node @ 0x4520b7728 System Class > | 8 | 32 > '- value org.apache.hadoop.hdfs.DistributedFileSystem @ 0x452990178 > | 56 | 4,976 > |- <class> class org.apache.hadoop.hdfs.DistributedFileSystem @ > 0x45402e290| 8 | 4,664 > |- uri java.net.URI @ 0x451a05cd0 hdfs://nameservice1 > | 80 | 432 > |- dfs org.apache.hadoop.hdfs.DFSClient @ 0x451f5d9b8 > | 128 | 3,824 > '- conf org.apache.hadoop.hive.conf.HiveConf @ 0x453a34b38 > | 80 | 250,160 > ---------------------------------------------------------------------------------------------------------------------- > {noformat} > As long as they are in the CACHE they cannot be garbage collected so this > leads to a memory leak. > The memory leak seems to come from the fact the compaction Cleaner attempts > to > [remove|https://github.com/apache/hive/blob/69e6a5a4151100849d2b03b6b14b1605c3abc3f1/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java#L266] > the obsolete files and fails. The exception does not allow the [filesystem > cleanup|https://github.com/apache/hive/blob/69e6a5a4151100849d2b03b6b14b1605c3abc3f1/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java#L270] > to take place so we are leaving filesystem entries in the CACHE and > subsequently configuration objects. > Although, the HMS unresponsiveness in this use-case may not be due to lack of > memory the leak needs to be addressed to avoid hitting OOM. -- This message was sent by Atlassian Jira (v8.20.10#820010)