Re: Diagnosing memory issues

Mike Tue, 11 Dec 2012 07:29:29 -0800

Thank you for the response. Since the time of this question, we'veidentified a number of areas that needed improving and have helpedthings along quite a bit. To answer your question, we were seeing bothParNew and CMS. There were no errors in the log, and all the nodes havebeen up.

However, we are seeing one interesting issue. We are running a 6 nodecluster with a Replication Factor of 3. The nodes are pretty evenlybalanced. All reads and writes to Cassandra uses LOCAL_QUORUMconsistency. We are seeing a very interesting problem from the JMXstatistics. We discovered we had one column family with an extremelyhigh and unexpected write count. The writes to this column family aredone in conjunction with other writes to other column families such thattheir numbers should be roughly equivalent, but they are off by a factorof 10. We have yet to find anything in our code that could cause thisdiscrepancy in numbers.

What really interesting is that we see this behavior on only 5 of the 6nodes in our cluster. On 5 of the 6 nodes, we see statistics indicatingwe are writing two fast and this specific memtable is exceeding itsmemtable 128M limit, while this one other node seems to be handling theload ok (Memtables stay within their limites). Given our replicationfactor, I'm not sure how this is possible.

Any hints on what might be causing this additional load? Are thereother activities in Cassandra might account for this increased load on asingle column family?


Any insights would be appreciated,
-Mike


On 12/4/2012 3:33 PM, aaron morton wrote:

For background, a discussion on estimating working sethttp://www.mail-archive.com/user@cassandra.apache.org/msg25762.html .You can also just look at the size of tenured heap after a CMS.
Are you seeing lots of ParNew or CMS ?
GC activity is a result of configuration *and* workload. Look in yourdata model for wide rows, or long lived rows that get a lot ofdeletes, and look in your code for large reads / writes (e.g.sometimes we read 100,000 columns from a row).
The number that really jumps out at me below is the number of Pendingrequests for the Message Service. 24,000+ pending requests.INFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java(line 89) MessagingService n/a 24,229
Technically speaking that ain't right.
The whole server looks unhappy.

Are there any errors in the logs ?
Are all the nodes up ?
A very blunt approach is to reduce the in_memory_compaction_limit andthe concurrent_compactors or compaction_throughput_mb_per_sec. Thisreduces the impact compaction and repair have on the system and maygive you breathing space to look at other causes. Once you have a feelfor what's going on you can turn them up.
Hope that helps.
A

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com
On 5/12/2012, at 7:04 AM, Mike <mthero...@yahoo.com<mailto:mthero...@yahoo.com>> wrote:
Hello,
Our Cassandra cluster has, relatively recently, started experiencingmemory pressure that I am in the midsts of diagnosing. Our systemhas uneven levels of traffic, relatively light during the day, butextremely heavy during some overnight processing. We have startedgetting a message:
WARN [ScheduledTasks:1] 2012-12-04 09:08:58,579 GCInspector.java(line 145) Heap is 0.7520105072262254 full. You may need to reducememtable and/or cache sizes. Cassandra will now flush up to the twolargest memtables to free up memory. Adjustflush_largest_memtables_at threshold in cassandra.yaml if you don'twant Cassandra to do this automatically
I've started implementing some instrumentation to gather stats fromJMX to determine what is happening. However, last night, theGCInspector was kind enough to log the information below. Couple ofthings jumped out at me.
The maximum heap for the Cassandra is 4GB. We are running Cassandra1.1.2, on a 6 node cluster, with a replication factor of 3. All ourqueries use LOCAL_QUORUM consistency.
Adding up the caches + the memtable "data" in the trace below, comesto under 600MB
The number that really jumps out at me below is the number of Pendingrequests for the Message Service. 24,000+ pending requests.
Does this number represent the number of outstanding client requeststhat this node is processing? If so, does this mean we potentiallyhave 24,000 responses being pulled into memory, thereby causing thismemory issue? What else should I look at?
INFO [ScheduledTasks:1] 2012-12-04 09:00:37,585 StatusLogger.java(line 57) Pool Name Active Pending BlockedINFO [ScheduledTasks:1] 2012-12-04 09:00:37,695 StatusLogger.java(line 72) ReadStage 32 66 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,696 StatusLogger.java(line 72) RequestResponseStage 0 193 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,696 StatusLogger.java(line 72) ReadRepairStage 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,696 StatusLogger.java(line 72) MutationStage 2 2 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,697 StatusLogger.java(line 72) ReplicateOnWriteStage 5 5 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,698 StatusLogger.java(line 72) GossipStage 0 13 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,698 StatusLogger.java(line 72) AntiEntropyStage 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,698 StatusLogger.java(line 72) MigrationStage 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,699 StatusLogger.java(line 72) StreamStage 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,699 StatusLogger.java(line 72) MemtablePostFlusher 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,699 StatusLogger.java(line 72) FlushWriter 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,700 StatusLogger.java(line 72) MiscStage 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,700 StatusLogger.java(line 72) commitlog_archiver 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,700 StatusLogger.java(line 72) InternalResponseStage 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,701 StatusLogger.java(line 72) AntiEntropySessions 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,701 StatusLogger.java(line 72) HintedHandoff 0 0 0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java(line 77) CompactionManager 2 4INFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java(line 89) MessagingService n/a 24,229INFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java(line 99) Cache Type Size Capacity KeysToSaveProviderINFO [ScheduledTasks:1] 2012-12-04 09:00:37,702 StatusLogger.java(line 100) KeyCache 2184533 2184533 allINFO [ScheduledTasks:1] 2012-12-04 09:00:37,703 StatusLogger.java(line 106) RowCache 52385581 52428800allorg.apache.cassandra.cache.SerializingCacheProviderINFO [ScheduledTasks:1] 2012-12-04 09:00:37,703 StatusLogger.java(line 113) ColumnFamily Memtable ops,dataINFO [ScheduledTasks:1] 2012-12-04 09:00:37,703 StatusLogger.java(line 116) system.NodeIdInfo 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,704 StatusLogger.java(line 116) system.IndexInfo 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java(line 116) system.LocationInfo 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java(line 116) system.Versions 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java(line 116) system.schema_keyspaces 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,705 StatusLogger.java(line 116) system.Migrations 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java(line 116) system.schema_columnfamilies 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java(line 116) system.schema_columns 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java(line 116) system.HintsColumnFamily 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,706 StatusLogger.java(line 116) system.Schema 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java(line 116) open.comp 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java(line 116) open.bp 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java(line 116) open.bn 312832,47184787INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java(line 116) open.p 711,193201INFO [ScheduledTasks:1] 2012-12-04 09:00:37,707 StatusLogger.java(line 116) open.bid 273064,46316018INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java(line 116) open.rel 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java(line 116) open.images 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java(line 116) open.users 62287,86665510INFO [ScheduledTasks:1] 2012-12-04 09:00:37,708 StatusLogger.java(line 116) open.sessions 4710,13153051INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java(line 116) open.userIndices 4,1960INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java(line 116) open.caches 50,4813457INFO [ScheduledTasks:1] 2012-12-04 09:00:37,709 StatusLogger.java(line 116) open.content 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,710 StatusLogger.java(line 116) open.enrich 30,20793INFO [ScheduledTasks:1] 2012-12-04 09:00:37,744 StatusLogger.java(line 116) open.bt 1133,776831INFO [ScheduledTasks:1] 2012-12-04 09:00:37,863 StatusLogger.java(line 116) open.alias 253,163933INFO [ScheduledTasks:1] 2012-12-04 09:00:37,864 StatusLogger.java(line 116) open.bymsgid 249610,73075517INFO [ScheduledTasks:1] 2012-12-04 09:00:37,864 StatusLogger.java(line 116) open.rank 319956,70898417INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java(line 116) open.cmap 448,406193INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java(line 116) open.pmap 659,566220INFO [ScheduledTasks:1] 2012-12-04 09:00:37,865 StatusLogger.java(line 116) open.pict 50944,58659596INFO [ScheduledTasks:1] 2012-12-04 09:00:37,878 StatusLogger.java(line 116) open.w 0,0INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java(line 116) open.s 92395,46160381INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java(line 116) open.bymrel 136607,57780555INFO [ScheduledTasks:1] 2012-12-04 09:00:37,879 StatusLogger.java(line 116) open.m 26720,51150067
It's appreciated,
Thanks,
-Mike

Re: Diagnosing memory issues

Reply via email to