[ 
https://issues.apache.org/jira/browse/HDDS-11240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871276#comment-17871276
 ] 

Ivan Andika edited comment on HDDS-11240 at 8/6/24 10:06 AM:
-------------------------------------------------------------

[~weiming] Our cluster also experiences the same CPU issue due to ThreadLocal 
issue after upgrading our GC from CMS to G1 (due to high GC time caused by 
large number of user listKeys operations). We are currently still running on 
JDK 1.8.

A possible cause is that WeakReference entries in ThreadLocalMap are not 
getting cleaned fast enough / at all by the GC. Since ThreadLocalMap 
implementation seems to use linear probing, the larger the ThreadLocalMap, the 
longer it takes to do ThreadLocal operations which is why the CPU usage 
increased. Seems it's confirmed by 
[https://mail.openjdk.org/pipermail/zgc-dev/2020-November/000984.html.] 
Currently, we are trying to increase the G1 GC soft limit 
"-XX:MaxGCPauseMillis" parameter (from 200 to 400), in the hope that this will 
give the G1 GC more time to clean the WeakReference in ThreadLocalMap, but I'm 
not sure whether this will work.

I saw that in 
[https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit|https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit,]
 it was fixed by bumping the JDK version that incorporates 
[https://bugs.openjdk.org/browse/JDK-8188055] (and 
[https://bugs.openjdk.org/browse/JDK-8256167)] . Could you help check whether 
your JDK has integrated this change and whether the problem was resolved?

Also what is the GC implementation the OMs are currently using?

cc: [~XiChen] [~whbing]  


was (Author: JIRAUSER298977):
[~weiming] Our cluster also experiences the same CPU issue due to ThreadLocal 
issue after upgrading our GC from CMS to G1 (due to high GC time caused by 
large number of user listKeys operations). We are currently still running on 
JDK 1.8.

A possible cause is that WeakReference entries in ThreadLocalMap are not 
getting cleaned fast enough / at all by the GC. Since ThreadLocalMap 
implementation seems to use linear probing, the large the ThreadLocalMap, the 
longer it takes to do ThreadLocal operations which is why the CPU usage 
increased. Seems it's confirmed by 
[https://mail.openjdk.org/pipermail/zgc-dev/2020-November/000984.html.] 
Currently, we are trying to increase the G1 GC soft limit 
"-XX:MaxGCPauseMillis" parameter (from 200 to 400), in the hope that this will 
give the G1 GC more time to clean the WeakReference in ThreadLocalMap, but I'm 
not sure whether this will work.

I saw that in 
[https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit|https://docs.google.com/document/d/1g1h-63fvA-be-clvyVRAHLWehoadCnjNRWyX8Bp-UIU/edit,]
 it was fixed by bumping the JDK version that incorporates 
[https://bugs.openjdk.org/browse/JDK-8188055] (and 
[https://bugs.openjdk.org/browse/JDK-8256167)] . Could you help check whether 
your JDK has integrated this change and whether the problem was resolved?

Also what is the GC implementation the OMs are currently using?

cc: [~XiChen] [~whbing]  

> High cpu usage on ReadWrite locks in JDK17
> ------------------------------------------
>
>                 Key: HDDS-11240
>                 URL: https://issues.apache.org/jira/browse/HDDS-11240
>             Project: Apache Ozone
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>         Environment: JDK:
> openjdk 17.0.2 2022-01-18
> OpenJDK Runtime Environment (build 17.0.2+8-86)
> OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)
> Ozone:
> 1.4.0
>  
>            Reporter: weiming
>            Assignee: Tanvi Penumudy
>            Priority: Major
>         Attachments: flamegraph.profile.html, 
> image-2024-07-28-20-17-58-466.png, image-2024-07-30-09-32-16-320.png
>
>
> That will cause threads on the following stack trace to consume a lot of CPU:
> "IPC Server handler 7 on default port 9862" #3994 daemon prio=5 os_prio=0 
> cpu=5403833.36ms elapsed=653145.54s tid=0x00007fa03fdd2a00 nid=0x921f9 
> runnable  [0x00007fa0ca3fd000]
>    java.lang.Thread.State: RUNNABLE
>         at 
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(java.base@17.0.2/ThreadLocal.java:632)
>         at 
> java.lang.ThreadLocal$ThreadLocalMap.remove(java.base@17.0.2/ThreadLocal.java:516)
>         at java.lang.ThreadLocal.remove(java.base@17.0.2/ThreadLocal.java:242)
>         at 
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(java.base@17.0.2/ReentrantReadWriteLock.java:430)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(java.base@17.0.2/AbstractQueuedSynchronizer.java:1094)
>         at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(java.base@17.0.2/ReentrantReadWriteLock.java:897)
>         at 
> org.apache.hadoop.ozone.upgrade.AbstractLayoutVersionManager.needsFinalization(AbstractLayoutVersionManager.java:182)
>         at 
> org.apache.hadoop.ozone.om.request.validation.ValidationCondition$1.shouldApply(ValidationCondition.java:39)
>         at 
> org.apache.hadoop.ozone.om.request.validation.RequestValidations.lambda$0(RequestValidations.java:110)
>         at 
> org.apache.hadoop.ozone.om.request.validation.RequestValidations$$Lambda$839/0x00000008013cda80.test(Unknown
>  Source)
>  
> [^flamegraph.profile.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to