Re: Many ParNew collections
Thanks Aaron, I found the problem. It's in this thread: minor compaction and delete expired column-tombstones. The problem was that I have one big row called Index which contains many tombstones. Reading all these tombstones caused the memory issues. I think node 1 and 3 have had enough minor compactions so that the tombstones were removed. The second node still contains several old SSTables and it takes some time before the whole thing is compacted again. Thanks, Rene 2012/9/17 aaron morton aa...@thelastpickle.com: The second node (the one suffering from many GC) has a high read latency compared to the others. Another thing is that the compacted row maximum size is bigger than on the other nodes. Node 2 also: * has about 220MB of data, while the others have about 45MB * has about 1 Million keys while the others have about 0.3 Million - Should the other nodes also have that wide row, yes. Are you running repair ? What CL are you using ? - Could repeatedly reading a wide row cause parnew problems? Maybe. Are you reading the whole thing ? It's only 22MB, it's big but not huge. I would: * ensure repair is running and completing, this may even out the data load. * determine if GC is associate with compactions, repair or general activity. * if Gc is associated with compactions the simple thing is to reduce concurrent_compactions and in_memory_compaction_limit in the yaml. Note this is often a simple / quick fix that can increase IO load and slow down compaction. The harder thing is to tune the JVM memory settings (the defaults often do a good job). Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/09/2012, at 10:41 PM, Rene Kochen rene.koc...@schange.com wrote: Thanks Aaron, At another production site the exact same problems occur (also after ~6 months). Here I have a very small cluster of three nodes with replication factor = 3. One of the three nodes begins to have many long Parnews and high CPU load. I upgraded to Cassandra 1.0.11, but the GC problem still continues on that node. If I look at the CFStats of the three nodes, there is one CF which is different: Column Family: Logs SSTable count: 1 Space used (live): 47606705 Space used (total): 47606705 Number of Keys (estimate): 338176 Memtable Columns Count: 22297 Memtable Data Size: 51542275 Memtable Switch Count: 1 Read Count: 189441 Read Latency: 0,768 ms. Write Count: 123411 Write Latency: 0,035 ms. Pending Tasks: 0 Bloom Filter False Postives: 0 Bloom Filter False Ratio: 0,0 Bloom Filter Space Used: 721456 Key cache capacity: 20 Key cache size: 56685 Key cache hit rate: 0.9132482658217008 Row cache: disabled Compacted row minimum size: 73 Compacted row maximum size: 263210 Compacted row mean size: 94 Column Family: Logs SSTable count: 3 Space used (live): 233688199 Space used (total): 233688199 Number of Keys (estimate): 1191936 Memtable Columns Count: 20147 Memtable Data Size: 47067518 Memtable Switch Count: 1 Read Count: 188473 Read Latency: 4031,791 ms. Write Count: 120412 Write Latency: 0,042 ms. Pending Tasks: 0 Bloom Filter False Postives: 234 Bloom Filter False Ratio: 0,0 Bloom Filter Space Used: 2603808 Key cache capacity: 20 Key cache size: 5153 Key cache hit rate: 1.0 Row cache: disabled Compacted row minimum size: 73 Compacted row maximum size: 25109160 Compacted row mean size: 156 Column Family: Logs SSTable count: 1 Space used (live): 47714798 Space used (total): 47714798 Number of Keys (estimate): 338176 Memtable Columns Count: 29046 Memtable Data Size: 66585390 Memtable Switch Count: 1 Read Count: 196048 Read Latency: 1,466 ms. Write Count: 127709 Write Latency: 0,034 ms. Pending Tasks: 0 Bloom Filter False Postives: 8 Bloom Filter False Ratio: 0,00847 Bloom Filter Space Used: 720496 Key cache capacity: 20 Key cache size: 54166 Key cache hit rate: 0.9833443960960739 Row cache: disabled Compacted row minimum size: 73 Compacted row maximum size: 263210 Compacted row mean size: 95 The second node (the one suffering from many GC) has a high read latency compared to the others. Another thing is that the compacted row maximum size is bigger than on the other nodes. What puzzles me: - Should the other nodes also have that wide row, because the replication factor is three and I only have three nodes? I must say that the wide row is probably the index row which has columns added/removed continuously. Maybe the other nodes lost much data because of compactions? - Could repeatedly reading a wide row cause parnew problems? Thanks! Rene 2012/8/17 aaron morton aa...@thelastpickle.com: - Cassandra 0.7.10 You _really_ should look at getting up to 1.1 :) Memory management is much better and the JVM heap requirements are less. However, there is one node with high read latency and far too many ParNew collections (compared
Re: Many ParNew collections
Thanks Aaron, At another production site the exact same problems occur (also after ~6 months). Here I have a very small cluster of three nodes with replication factor = 3. One of the three nodes begins to have many long Parnews and high CPU load. I upgraded to Cassandra 1.0.11, but the GC problem still continues on that node. If I look at the CFStats of the three nodes, there is one CF which is different: Column Family: Logs SSTable count: 1 Space used (live): 47606705 Space used (total): 47606705 Number of Keys (estimate): 338176 Memtable Columns Count: 22297 Memtable Data Size: 51542275 Memtable Switch Count: 1 Read Count: 189441 Read Latency: 0,768 ms. Write Count: 123411 Write Latency: 0,035 ms. Pending Tasks: 0 Bloom Filter False Postives: 0 Bloom Filter False Ratio: 0,0 Bloom Filter Space Used: 721456 Key cache capacity: 20 Key cache size: 56685 Key cache hit rate: 0.9132482658217008 Row cache: disabled Compacted row minimum size: 73 Compacted row maximum size: 263210 Compacted row mean size: 94 Column Family: Logs SSTable count: 3 Space used (live): 233688199 Space used (total): 233688199 Number of Keys (estimate): 1191936 Memtable Columns Count: 20147 Memtable Data Size: 47067518 Memtable Switch Count: 1 Read Count: 188473 Read Latency: 4031,791 ms. Write Count: 120412 Write Latency: 0,042 ms. Pending Tasks: 0 Bloom Filter False Postives: 234 Bloom Filter False Ratio: 0,0 Bloom Filter Space Used: 2603808 Key cache capacity: 20 Key cache size: 5153 Key cache hit rate: 1.0 Row cache: disabled Compacted row minimum size: 73 Compacted row maximum size: 25109160 Compacted row mean size: 156 Column Family: Logs SSTable count: 1 Space used (live): 47714798 Space used (total): 47714798 Number of Keys (estimate): 338176 Memtable Columns Count: 29046 Memtable Data Size: 66585390 Memtable Switch Count: 1 Read Count: 196048 Read Latency: 1,466 ms. Write Count: 127709 Write Latency: 0,034 ms. Pending Tasks: 0 Bloom Filter False Postives: 8 Bloom Filter False Ratio: 0,00847 Bloom Filter Space Used: 720496 Key cache capacity: 20 Key cache size: 54166 Key cache hit rate: 0.9833443960960739 Row cache: disabled Compacted row minimum size: 73 Compacted row maximum size: 263210 Compacted row mean size: 95 The second node (the one suffering from many GC) has a high read latency compared to the others. Another thing is that the compacted row maximum size is bigger than on the other nodes. What puzzles me: - Should the other nodes also have that wide row, because the replication factor is three and I only have three nodes? I must say that the wide row is probably the index row which has columns added/removed continuously. Maybe the other nodes lost much data because of compactions? - Could repeatedly reading a wide row cause parnew problems? Thanks! Rene 2012/8/17 aaron morton aa...@thelastpickle.com: - Cassandra 0.7.10 You _really_ should look at getting up to 1.1 :) Memory management is much better and the JVM heap requirements are less. However, there is one node with high read latency and far too many ParNew collections (compared to other nodes). Check for long running compactions or repairs. Check for unexpected row or key cache settings. Look at the connections on the box and see if it is getting more client load than others. Restart the node and see if the situations returns. If it does try to correlate the start of the GC issues with the cassandra log. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/08/2012, at 2:58 AM, Rene Kochen rene.koc...@schange.com wrote: Hi I have a cluster of 7 nodes: - Windows Server 2008 - Cassandra 0.7.10 - The nodes are identical (hardware, configuration and client request load) - Standard batch file with 8GB heap - I use disk_access_mode = standard - Random partitioner - TP stats shows no problems - Ring command shows no problems (data is balanced) However, there is one node with high read latency and far too many ParNew collections (compared to other nodes). It also suffers from a high CPU load (I guess due to the ParNew collections). What can be the source of so many ParNew collections? The other identical nodes do not have this behavior. Logging: 2012-08-16 15:58:46,436906.072: [GC 906.072: [ParNew: 345022K-38336K(345024K), 0.2375976 secs] 3630599K-3516249K(8350272K), 0.2377296 secs] [Times: user=3.21 sys=0.00, real=0.23 secs] 2012-08-16 15:58:46,888906.517: [GC 906.517: [ParNew: 345024K-38336K(345024K), 0.2400594 secs] 3822937K-3743690K(8350272K), 0.2401802 secs] [Times: user=3.48 sys=0.03, real=0.25 secs] 2012-08-16 15:58:46,888 INFO 15:58:46,888 GC for ParNew: 478 ms for 2 collections, 3837792904 used; max is 8550678528 2012-08-16 15:58:47,372907.003: [GC 907.003: [ParNew: 345024K-38336K(345024K), 0.2405363 secs] 4050378K-3971544K(8350272K), 0.2406553 secs] [Times: user=3.34 sys=0.01, real=0.23 secs] 2012-08-16 15:58
Many ParNew collections
Hi I have a cluster of 7 nodes: - Windows Server 2008 - Cassandra 0.7.10 - The nodes are identical (hardware, configuration and client request load) - Standard batch file with 8GB heap - I use disk_access_mode = standard - Random partitioner - TP stats shows no problems - Ring command shows no problems (data is balanced) However, there is one node with high read latency and far too many ParNew collections (compared to other nodes). It also suffers from a high CPU load (I guess due to the ParNew collections). What can be the source of so many ParNew collections? The other identical nodes do not have this behavior. Logging: 2012-08-16 15:58:46,436906.072: [GC 906.072: [ParNew: 345022K-38336K(345024K), 0.2375976 secs] 3630599K-3516249K(8350272K), 0.2377296 secs] [Times: user=3.21 sys=0.00, real=0.23 secs] 2012-08-16 15:58:46,888906.517: [GC 906.517: [ParNew: 345024K-38336K(345024K), 0.2400594 secs] 3822937K-3743690K(8350272K), 0.2401802 secs] [Times: user=3.48 sys=0.03, real=0.25 secs] 2012-08-16 15:58:46,888 INFO 15:58:46,888 GC for ParNew: 478 ms for 2 collections, 3837792904 used; max is 8550678528 2012-08-16 15:58:47,372907.003: [GC 907.003: [ParNew: 345024K-38336K(345024K), 0.2405363 secs] 4050378K-3971544K(8350272K), 0.2406553 secs] [Times: user=3.34 sys=0.01, real=0.23 secs] 2012-08-16 15:58:47,918907.544: [GC 907.544: [ParNew: 345024K-38336K(345024K), 0.2404339 secs] 4278232K-4193789K(8350272K), 0.2405540 secs] [Times: user=3.31 sys=0.00, real=0.25 secs] 2012-08-16 15:58:47,918 INFO 15:58:47,918 GC for ParNew: 481 ms for 2 collections, 4300939752 used; max is 8550678528 2012-08-16 15:58:48,464908.079: [GC 908.079: [ParNew: 345024K-38336K(345024K), 0.2621174 secs] 4500477K-4434112K(8350272K), 0.2622375 secs] [Times: user=3.64 sys=0.00, real=0.25 secs] 2012-08-16 15:58:48,932 INFO 15:58:48,932 GC for ParNew: 262 ms for 1 collections, 4763583200 used; max is 8550678528 2012-08-16 15:58:49,384909.050: [GC 909.051: [ParNew: 344972K-38336K(345024K), 0.2033453 secs] 4740748K-4563252K(8350272K), 0.2034588 secs] [Times: user=2.89 sys=0.01, real=0.20 secs] 2012-08-16 15:58:49,946 INFO 15:58:49,946 GC for ParNew: 203 ms for 1 collections, 4885945792 used; max is 8550678528 2012-08-16 15:58:50,383909.998: [GC 909.998: [ParNew: 345024K-38336K(345024K), 0.2567542 secs] 4869940K-4740489K(8350272K), 0.2568804 secs] [Times: user=3.60 sys=0.00, real=0.25 secs] 2012-08-16 15:58:50,882910.474: [GC 910.474: [ParNew: 345024K-38336K(345024K), 0.2786205 secs] 5047177K-4962531K(8350272K), 0.2787668 secs] [Times: user=3.48 sys=0.00, real=0.28 secs] 2012-08-16 15:58:50,960 INFO 15:58:50,960 GC for ParNew: 536 ms for 2 collections, 5143423816 used; max is 8550678528 2012-08-16 15:58:51,584911.192: [GC 911.192: [ParNew: 344963K-38334K(345024K), 0.2664316 secs] 5269158K-5196444K(8350272K), 0.2665544 secs] [Times: user=3.74 sys=0.00, real=0.27 secs] 2012-08-16 15:58:52,130911.767: [GC 911.767: [ParNew: 345022K-38336K(345024K), 0.2327209 secs] 5503132K-5406771K(8350272K), 0.2328457 secs] [Times: user=3.35 sys=0.00, real=0.23 secs] 2012-08-16 15:58:52,130 INFO 15:58:52,130 GC for ParNew: 499 ms for 2 collections, 5541845264 used; max is 8550678528 2012-08-16 15:58:52,816912.460: [GC 912.460: [ParNew: 345024K-38334K(345024K), 0.2198399 secs] 5713459K-5605670K(8350272K), 0.2199669 secs] [Times: user=3.29 sys=0.00, real=0.23 secs] 2012-08-16 15:58:53,144 INFO 15:58:53,144 GC for ParNew: 220 ms for 1 collections, 5805870608 used; max is 8550678528 2012-08-16 15:58:55,546915.173: [GC 915.173: [ParNew: 345022K-38334K(345024K), 0.2369585 secs] 5912358K-5702871K(8350272K), 0.2371098 secs] [Times: user=3.18 sys=0.00, real=0.25 secs] 2012-08-16 15:58:56,186 INFO 15:58:56,186 GC for ParNew: 237 ms for 1 collections, 6089002480 used; max is 8550678528 2012-08-16 15:58:56,591916.232: [GC 916.232: [ParNew: 345022K-38336K(345024K), 0.2364850 secs] 6009559K-5914142K(8350272K), 0.2366075 secs] [Times: user=3.32 sys=0.00, real=0.23 secs] 2012-08-16 15:58:57,340916.989: [GC 916.989: [ParNew: 345024K-38334K(345024K), 0.2191751 secs] 6220830K-6107217K(8350272K), 0.2192894 secs] [Times: user=2.92 sys=0.00, real=0.22 secs] 2012-08-16 15:58:57,371917.209: [GC [1 CMS-initial-mark: 6068883K(8005248K)] 6108716K(8350272K), 0.0272472 secs] [Times: user=0.03 sys=0.00, real=0.03 secs] 2012-08-16 15:58:57,371917.236: [CMS-concurrent-mark-start] 2012-08-16 15:58:57,371 INFO 15:58:57,371 GC for ParNew: 456 ms for 2 collections, 6255865264 used; max is 8550678528 2012-08-16 15:58:57,574917.444: [CMS-concurrent-mark: 0.208/0.208 secs] [Times: user=1.48 sys=0.06, real=0.20 secs] 2012-08-16 15:58:57,574917.444: [CMS-concurrent-preclean-start] 2012-08-16 15:58:57,637917.501: [CMS-concurrent-preclean: 0.057/0.057 secs] [Times: user=0.19 sys=0.00, real=0.06 secs] 2012-08-16 15:58:57,637917.501: [CMS-concurrent-abortable-preclean-start] 2012-08-16 15:58:58,775918.552: [GC 918.552: [ParNew: 345022K-38334K(345024K), 0.0948325 secs] 6413905K-6163063K(8350272K), 0.0949661