[ https://issues.apache.org/jira/browse/HDDS-11240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871276#comment-17871276 ]
Ivan Andika edited comment on HDDS-11240 at 8/6/24 9:47 AM: ------------------------------------------------------------ [~weiming] Our cluster also experiences the same CPU issue due to ThreadLocal issue after upgrading our GC from CMS to G1 (due to high GC time caused by large number of user listKeys operations). We are currently still running on JDK 1.8. The main possible cause is that WeakReference entries in ThreadLocalMap are not getting cleaned fast enough / at all by the GC. Since ThreadLocalMap implementation seems to use linear probing, the large the ThreadLocalMap, the longer it takes to do ThreadLocal operations which is why the CPU usage increased. Seems it's confirmed by [https://mail.openjdk.org/pipermail/zgc-dev/2020-November/000984.html.] Currently, we are trying to increase the G1 GC soft limit "-XX:MaxGCPauseMillis" parameter (from 200 to 400), in the hope that this will give the G1 GC more time to clean the WeakReference in ThreadLocalMap, but I'm not sure whether this will work. I saw that in [https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit|https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit,] it was fixed by bumping the JDK version that incorporates [https://bugs.openjdk.org/browse/JDK-8188055] (and [https://bugs.openjdk.org/browse/JDK-8256167)] . Could you help check whether your JDK has integrated this change and whether the problem was resolved? Also what is the GC implementation the OMs are currently using? cc: [~XiChen] [~whbing] was (Author: JIRAUSER298977): [~weiming] Our cluster also experiences the same CPU issue due to ThreadLocal issue after upgrading our GC from CMS to G1 (due to high GC time caused by large number of user listKeys operations). We are currently still running on JDK 1.8. The main possible cause is that WeakReference entries in ThreadLocalMap are not getting cleaned fast enough / at all by the GC. Since ThreadLocalMap implementation seems to use linear probing, the large the ThreadLocalMap, the longer it takes to do ThreadLocal operations which is why the CPU usage increased. Seems it's confirmed by [https://mail.openjdk.org/pipermail/zgc-dev/2020-November/000984.html.] Currently, we are trying to increase the G1 GC soft limit "-XX:MaxGCPauseMillis" parameter (from 200 to 400), in the hope that this will give the G1 GC more time to clean the WeakReference in ThreadLocalMap. I saw that in [https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit|https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit,] it was fixed by bumping the JDK version that incorporates [https://bugs.openjdk.org/browse/JDK-8188055] (and [https://bugs.openjdk.org/browse/JDK-8256167)] . Could you help check whether your JDK has integrated this change and whether the problem was resolved? Also what is the GC implementation the OMs are currently using? cc: [~XiChen] [~whbing] > High cpu usage on ReadWrite locks in JDK17 > ------------------------------------------ > > Key: HDDS-11240 > URL: https://issues.apache.org/jira/browse/HDDS-11240 > Project: Apache Ozone > Issue Type: Bug > Affects Versions: 1.4.0 > Environment: JDK: > openjdk 17.0.2 2022-01-18 > OpenJDK Runtime Environment (build 17.0.2+8-86) > OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing) > Ozone: > 1.4.0 > > Reporter: weiming > Assignee: Tanvi Penumudy > Priority: Major > Attachments: flamegraph.profile.html, > image-2024-07-28-20-17-58-466.png, image-2024-07-30-09-32-16-320.png > > > That will cause threads on the following stack trace to consume a lot of CPU: > "IPC Server handler 7 on default port 9862" #3994 daemon prio=5 os_prio=0 > cpu=5403833.36ms elapsed=653145.54s tid=0x00007fa03fdd2a00 nid=0x921f9 > runnable [0x00007fa0ca3fd000] > java.lang.Thread.State: RUNNABLE > at > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(java.base@17.0.2/ThreadLocal.java:632) > at > java.lang.ThreadLocal$ThreadLocalMap.remove(java.base@17.0.2/ThreadLocal.java:516) > at java.lang.ThreadLocal.remove(java.base@17.0.2/ThreadLocal.java:242) > at > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(java.base@17.0.2/ReentrantReadWriteLock.java:430) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(java.base@17.0.2/AbstractQueuedSynchronizer.java:1094) > at > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(java.base@17.0.2/ReentrantReadWriteLock.java:897) > at > org.apache.hadoop.ozone.upgrade.AbstractLayoutVersionManager.needsFinalization(AbstractLayoutVersionManager.java:182) > at > org.apache.hadoop.ozone.om.request.validation.ValidationCondition$1.shouldApply(ValidationCondition.java:39) > at > org.apache.hadoop.ozone.om.request.validation.RequestValidations.lambda$0(RequestValidations.java:110) > at > org.apache.hadoop.ozone.om.request.validation.RequestValidations$$Lambda$839/0x00000008013cda80.test(Unknown > Source) > > [^flamegraph.profile.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org