[ https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Boicehuang updated LUCENE-10519: -------------------------------- Description: h2. Problem ---- {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using {*}ThreadLocal<WeakReference<T>>{*}) may still have a flaw under G1GC. There is a single ThreadLocalMap stored for each thread, which all ThreadLocals share, and that master map only periodically purges stale entries. When we close a CloseableThreadLocal, we only take care of the current thread right now, others will be taken care of via the WeakReferences. Under G1GC, the WeakReferences of other threads may not be recycled even after several rounds of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily long amount of CPU and time to iterate the things you had stored in it. Hot thread of elasticsearch: {code:java} ::: {xxxxxxxxx}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{xxxxxxxx}{xxxxxxxx} Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true: 105.3% (526.5ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxxxxx][bulk][T#31]' 10/10 snapshots sharing following 34 elements java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) java.lang.ThreadLocal.remove(ThreadLocal.java:224) java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) {code} h2. Solution ---- This bug does not reproduce under CMS. It can be reproduced under G1GC always. In fact we don't need to store entry twice in the hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so that we would not be affected by the serious flaw of Java's built-in ThreadLocal. I would like to provide a patch to fix it if needed. h2. See also ---- [https://github.com/elastic/elasticsearch/issues/56766] [https://bugs.openjdk.java.net/browse/JDK-8182982] [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] was: h2. Problem ---- I found {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using {*}ThreadLocal<WeakReference<T>>{*}) may still have a flaw under G1GC. There is a single ThreadLocalMap stored for each thread, which all ThreadLocals share, and that master map only periodically purges stale entries. When we close a CloseableThreadLocal, we only take care of the current thread right now, others will be taken care of via the WeakReferences. Under G1GC, the WeakReferences of other threads may not be recycled even after several rounds of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily long amount of CPU and time to iterate the things you had stored in it. Hot thread of elasticsearch: {code:java} ::: {xxxxxxxxx}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{xxxxxxxx}{xxxxxxxx} Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true: 105.3% (526.5ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxxxxx][bulk][T#31]' 10/10 snapshots sharing following 34 elements java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) java.lang.ThreadLocal.remove(ThreadLocal.java:224) java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) {code} h2. Solution ---- This bug does not reproduce under CMS. It can be reproduced under G1GC always. In fact we don't need to store entry twice in the hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so that we would not be affected by the serious flaw of Java's built-in ThreadLocal. I would like to provide a patch to fix it if needed. h2. See also ---- [https://github.com/elastic/elasticsearch/issues/56766] [https://bugs.openjdk.java.net/browse/JDK-8182982] [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] > ThreadLocal.remove under G1GC takes 100% CPU > -------------------------------------------- > > Key: LUCENE-10519 > URL: https://issues.apache.org/jira/browse/LUCENE-10519 > Project: Lucene - Core > Issue Type: Bug > Components: core/other > Affects Versions: 8.9, 8.10.1, 8.11.1 > Environment: Elasticsearch v7.16.0 > OpenJDK v11 > Reporter: Boicehuang > Priority: Critical > Labels: CloseableThreadLocal > Time Spent: 10m > Remaining Estimate: 0h > > h2. Problem > ---- > {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using > {*}ThreadLocal<WeakReference<T>>{*}) may still have a flaw under G1GC. There > is a single ThreadLocalMap stored for each thread, which all ThreadLocals > share, and that master map only periodically purges stale entries. When we > close a CloseableThreadLocal, we only take care of the current thread right > now, others will be taken care of via the WeakReferences. Under G1GC, the > WeakReferences of other threads may not be recycled even after several rounds > of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily > long amount of CPU and time to iterate the things you had stored in it. > Hot thread of elasticsearch: > {code:java} > ::: > {xxxxxxxxx}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{xxxxxxxx}{xxxxxxxx} > Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, > ignoreIdleThreads=true: > > 105.3% (526.5ms out of 500ms) cpu usage by thread > 'elasticsearch[xxxxxxxx][bulk][T#31]' > 10/10 snapshots sharing following 34 elements > > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) > java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) > java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) > java.lang.ThreadLocal.remove(ThreadLocal.java:224) > > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) > > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) > > org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) > > org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) > > org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) > org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) > {code} > h2. Solution > ---- > This bug does not reproduce under CMS. It can be reproduced under G1GC always. > In fact we don't need to store entry twice in the hardRefs And ThreadLocals. > Remove ThreadLocal from CloseableThreadLocal so that we would not be affected > by the serious flaw of Java's built-in ThreadLocal. I would like to provide a > patch to fix it if needed. > h2. See also > ---- > [https://github.com/elastic/elasticsearch/issues/56766] > [https://bugs.openjdk.java.net/browse/JDK-8182982] > [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org