[ https://issues.apache.org/jira/browse/CASSANDRA-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13865589#comment-13865589 ]
Aleksey Yeschenko commented on CASSANDRA-6405: ---------------------------------------------- [~alienth] No built-in metrics that comes to mind. [~slebresne] any ideas? > When making heavy use of counters, neighbor nodes occasionally enter spiral > of constant memory consumpion > --------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-6405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6405 > Project: Cassandra > Issue Type: Bug > Environment: RF of 3, 15 nodes. > Sun Java 7 (also occurred in OpenJDK 6, and Sun Java 6). > Xmx of 8G. > No row cache. > Reporter: Jason Harvey > Attachments: threaddump.txt > > > We're randomly running into an interesting issue on our ring. When making use > of counters, we'll occasionally have 3 nodes (always neighbors) suddenly > start immediately filling up memory, CMSing, fill up again, repeat. This > pattern goes on for 5-20 minutes. Nearly all requests to the nodes time out > during this period. Restarting one, two, or all three of the nodes does not > resolve the spiral; after a restart the three nodes immediately start hogging > up memory again and CMSing constantly. > When the issue resolves itself, all 3 nodes immediately get better. Sometimes > it reoccurs in bursts, where it will be trashed for 20 minutes, fine for 5, > trashed for 20, and repeat that cycle a few times. > There are no unusual logs provided by cassandra during this period of time, > other than recording of the constant dropped read requests and the constant > CMS runs. I have analyzed the log files prior to multiple distinct instances > of this issue and have found no preceding events which are associated with > this issue. > I have verified that our apps are not performing any unusual number or type > of requests during this time. > This behaviour occurred on 1.0.12, 1.1.7, and now on 1.2.11. > The way I've narrowed this down to counters is a bit naive. It started > happening when we started making use of counter columns, went away after we > rolled back use of counter columns. I've repeated this attempted rollout on > each version now, and it consistently rears its head every time. I should > note this incident does _seem_ to happen more rarely on 1.2.11 compared to > the previous versions. > This incident has been consistent across multiple different types of > hardware, as well as major kernel version changes (2.6 all the way to 3.2). > The OS is operating normally during the event. > I managed to get an hprof dump when the issue was happening in the wild. > Something notable in the class instance counts as reported by jhat. Here are > the top 5 counts for this one node: > {code} > 5967846 instances of class org.apache.cassandra.db.CounterColumn > 1247525 instances of class > com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue > 1247310 instances of class org.apache.cassandra.cache.KeyCacheKey > 1246648 instances of class > com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node > 1237526 instances of class org.apache.cassandra.db.RowIndexEntry > {code} > Is it normal or expected for CounterColumn to have that number of instances? > The data model for how we use counters is as follows: between 50-20000 > counter columns per key. We currently have around 3 million keys total, but > this issue also replicated when we only had a few thousand keys total. > Average column count is around 1k, and 90th is 18k. New columns are added > regularly, and columns are incremented regularly. No column or key deletions > occur. We probably have 1-5k "hot" keys at any given time, spread across the > entire ring. R:W ratio is typically around 50:1. This is the only CF we're > using counters on, at this time. CF details are as follows: > {code} > ColumnFamily: CommentTree > Key Validation Class: org.apache.cassandra.db.marshal.AsciiType > Default column value validator: > org.apache.cassandra.db.marshal.CounterColumnType > Cells sorted by: > org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.LongType) > GC grace seconds: 864000 > Compaction min/max thresholds: 4/32 > Read repair chance: 0.01 > DC Local Read repair chance: 0.0 > Populate IO Cache on flush: false > Replicate on write: true > Caching: KEYS_ONLY > Bloom Filter FP chance: default > Built indexes: [] > Compaction Strategy: > org.apache.cassandra.db.compaction.LeveledCompactionStrategy > Compaction Strategy Options: > sstable_size_in_mb: 160 > Column Family: CommentTree > SSTable count: 30 > SSTables in each level: [1, 10, 19, 0, 0, 0, 0, 0, 0] > Space used (live): 4656930594 > Space used (total): 4677221791 > SSTable Compression Ratio: 0.0 > Number of Keys (estimate): 679680 > Memtable Columns Count: 8289 > Memtable Data Size: 2639908 > Memtable Switch Count: 5769 > Read Count: 185479324 > Read Latency: 1.786 ms. > Write Count: 5377562 > Write Latency: 0.026 ms. > Pending Tasks: 0 > Bloom Filter False Positives: 2914204 > Bloom Filter False Ratio: 0.56403 > Bloom Filter Space Used: 523952 > Compacted row minimum size: 30 > Compacted row maximum size: 4866323 > Compacted row mean size: 7742 > Average live cells per slice (last five minutes): 39.0 > Average tombstones per slice (last five minutes): 0.0 > {code} > Please let me know if I can provide any further information. I can provide > the hprof if desired, however it is 3GB so I'll need to provide it outside of > JIRA. -- This message was sent by Atlassian JIRA (v6.1.5#6160)