[ https://issues.apache.org/jira/browse/CASSANDRA-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135593#comment-16135593 ]
Hannu Kröger commented on CASSANDRA-13756: ------------------------------------------ Linking to SSTable Corruption ticket which this same bug seems to cause. > StreamingHistogram is not thread safe > ------------------------------------- > > Key: CASSANDRA-13756 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13756 > Project: Cassandra > Issue Type: Bug > Reporter: xiangzhou xia > Assignee: Jeff Jirsa > Fix For: 3.0.x, 3.11.x > > > When we test C*3 in shadow cluster, we notice after a period of time, several > data node suddenly run into 100% cpu and stop process query anymore. > After investigation, we found that threads are stuck on the sum() in > streaminghistogram class. Those are jmx threads that working on expose > getTombStoneRatio metrics (since jmx is kicked off every 3 seconds, there is > a chance that multiple jmx thread is access streaminghistogram at the same > time). > After further investigation, we find that the optimization in CASSANDRA-13038 > led to a spool flush every time when we call sum(). Since TreeMap is not > thread safe, threads will be stuck when multiple threads visit sum() at the > same time. > There are two approaches to solve this issue. > The first one is to add a lock to the flush in sum() which will introduce > some extra overhead to streaminghistogram. > The second one is to avoid streaminghistogram to be access by multiple > threads. For our specific case, is to remove the metrics we added. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org