Re: Many ParNew collections

2012-09-17 Thread Rene Kochen
Thanks Aaron,

I found the problem. It's in this thread: minor compaction and delete
expired column-tombstones.

The problem was that I have one big row called Index which contains
many tombstones. Reading all these tombstones caused the memory
issues.

I think node 1 and 3 have had enough minor compactions so that the
tombstones were removed. The second node still contains several old
SSTables and it takes some time before the whole thing is compacted
again.

Thanks,

Rene

2012/9/17 aaron morton aa...@thelastpickle.com:
 The second node (the one suffering from many GC) has a high read
 latency compared to the others. Another thing is that the compacted
 row maximum size is bigger than on the other nodes.

 Node 2 also:
 * has about 220MB of data, while the others have about 45MB
 * has about 1 Million keys while the others have about 0.3 Million

 - Should the other nodes also have that wide row,

 yes. Are you running repair ? What CL are you using ?

 - Could repeatedly reading a wide row cause parnew problems?

 Maybe. Are you reading the whole thing ?
 It's only 22MB, it's big but not huge.

 I would:

 * ensure repair is running and completing, this may even out the data load.
 * determine if GC is associate with compactions, repair or general activity.
 * if Gc is associated with compactions the simple thing is to reduce
 concurrent_compactions and in_memory_compaction_limit in the yaml. Note this
 is often a simple / quick fix that can increase IO load and slow down
 compaction. The harder thing is to tune the JVM memory settings (the
 defaults often do a good job).

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 14/09/2012, at 10:41 PM, Rene Kochen rene.koc...@schange.com wrote:

 Thanks Aaron,

 At another production site the exact same problems occur (also after
 ~6 months). Here I have a very small cluster of three nodes with
 replication factor = 3.
 One of the three nodes begins to have many long Parnews and high CPU
 load. I upgraded to Cassandra 1.0.11, but the GC problem still
 continues on that node.

 If I look at the CFStats of the three nodes, there is one CF which is
 different:

 Column Family: Logs
 SSTable count: 1
 Space used (live): 47606705
 Space used (total): 47606705
 Number of Keys (estimate): 338176
 Memtable Columns Count: 22297
 Memtable Data Size: 51542275
 Memtable Switch Count: 1
 Read Count: 189441
 Read Latency: 0,768 ms.
 Write Count: 123411
 Write Latency: 0,035 ms.
 Pending Tasks: 0
 Bloom Filter False Postives: 0
 Bloom Filter False Ratio: 0,0
 Bloom Filter Space Used: 721456
 Key cache capacity: 20
 Key cache size: 56685
 Key cache hit rate: 0.9132482658217008
 Row cache: disabled
 Compacted row minimum size: 73
 Compacted row maximum size: 263210
 Compacted row mean size: 94

 Column Family: Logs
 SSTable count: 3
 Space used (live): 233688199
 Space used (total): 233688199
 Number of Keys (estimate): 1191936
 Memtable Columns Count: 20147
 Memtable Data Size: 47067518
 Memtable Switch Count: 1
 Read Count: 188473
 Read Latency: 4031,791 ms.
 Write Count: 120412
 Write Latency: 0,042 ms.
 Pending Tasks: 0
 Bloom Filter False Postives: 234
 Bloom Filter False Ratio: 0,0
 Bloom Filter Space Used: 2603808
 Key cache capacity: 20
 Key cache size: 5153
 Key cache hit rate: 1.0
 Row cache: disabled
 Compacted row minimum size: 73
 Compacted row maximum size: 25109160
 Compacted row mean size: 156

 Column Family: Logs
 SSTable count: 1
 Space used (live): 47714798
 Space used (total): 47714798
 Number of Keys (estimate): 338176
 Memtable Columns Count: 29046
 Memtable Data Size: 66585390
 Memtable Switch Count: 1
 Read Count: 196048
 Read Latency: 1,466 ms.
 Write Count: 127709
 Write Latency: 0,034 ms.
 Pending Tasks: 0
 Bloom Filter False Postives: 8
 Bloom Filter False Ratio: 0,00847
 Bloom Filter Space Used: 720496
 Key cache capacity: 20
 Key cache size: 54166
 Key cache hit rate: 0.9833443960960739
 Row cache: disabled
 Compacted row minimum size: 73
 Compacted row maximum size: 263210
 Compacted row mean size: 95

 The second node (the one suffering from many GC) has a high read
 latency compared to the others. Another thing is that the compacted
 row maximum size is bigger than on the other nodes.

 What puzzles me:

 - Should the other nodes also have that wide row, because the
 replication factor is three and I only have three nodes? I must say
 that the wide row is probably the index row which has columns
 added/removed continuously. Maybe the other nodes lost much data
 because of compactions?
 - Could repeatedly reading a wide row cause parnew problems?

 Thanks!

 Rene

 2012/8/17 aaron morton aa...@thelastpickle.com:

 - Cassandra 0.7.10

 You _really_ should look at getting up to 1.1 :) Memory management is much
 better and the JVM heap requirements are less.

 However, there is one node with high read latency and far too many
 ParNew collections (compared

Re: Many ParNew collections

2012-09-14 Thread Rene Kochen
Thanks Aaron,

At another production site the exact same problems occur (also after
~6 months). Here I have a very small cluster of three nodes with
replication factor = 3.
One of the three nodes begins to have many long Parnews and high CPU
load. I upgraded to Cassandra 1.0.11, but the GC problem still
continues on that node.

If I look at the CFStats of the three nodes, there is one CF which is different:

Column Family: Logs
SSTable count: 1
Space used (live): 47606705
Space used (total): 47606705
Number of Keys (estimate): 338176
Memtable Columns Count: 22297
Memtable Data Size: 51542275
Memtable Switch Count: 1
Read Count: 189441
Read Latency: 0,768 ms.
Write Count: 123411
Write Latency: 0,035 ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0,0
Bloom Filter Space Used: 721456
Key cache capacity: 20
Key cache size: 56685
Key cache hit rate: 0.9132482658217008
Row cache: disabled
Compacted row minimum size: 73
Compacted row maximum size: 263210
Compacted row mean size: 94

Column Family: Logs
SSTable count: 3
Space used (live): 233688199
Space used (total): 233688199
Number of Keys (estimate): 1191936
Memtable Columns Count: 20147
Memtable Data Size: 47067518
Memtable Switch Count: 1
Read Count: 188473
Read Latency: 4031,791 ms.
Write Count: 120412
Write Latency: 0,042 ms.
Pending Tasks: 0
Bloom Filter False Postives: 234
Bloom Filter False Ratio: 0,0
Bloom Filter Space Used: 2603808
Key cache capacity: 20
Key cache size: 5153
Key cache hit rate: 1.0
Row cache: disabled
Compacted row minimum size: 73
Compacted row maximum size: 25109160
Compacted row mean size: 156

Column Family: Logs
SSTable count: 1
Space used (live): 47714798
Space used (total): 47714798
Number of Keys (estimate): 338176
Memtable Columns Count: 29046
Memtable Data Size: 66585390
Memtable Switch Count: 1
Read Count: 196048
Read Latency: 1,466 ms.
Write Count: 127709
Write Latency: 0,034 ms.
Pending Tasks: 0
Bloom Filter False Postives: 8
Bloom Filter False Ratio: 0,00847
Bloom Filter Space Used: 720496
Key cache capacity: 20
Key cache size: 54166
Key cache hit rate: 0.9833443960960739
Row cache: disabled
Compacted row minimum size: 73
Compacted row maximum size: 263210
Compacted row mean size: 95

The second node (the one suffering from many GC) has a high read
latency compared to the others. Another thing is that the compacted
row maximum size is bigger than on the other nodes.

What puzzles me:

- Should the other nodes also have that wide row, because the
replication factor is three and I only have three nodes? I must say
that the wide row is probably the index row which has columns
added/removed continuously. Maybe the other nodes lost much data
because of compactions?
- Could repeatedly reading a wide row cause parnew problems?

Thanks!

Rene

2012/8/17 aaron morton aa...@thelastpickle.com:
 - Cassandra 0.7.10

 You _really_ should look at getting up to 1.1 :) Memory management is much
 better and the JVM heap requirements are less.

 However, there is one node with high read latency and far too many
 ParNew collections (compared to other nodes).

 Check for long running compactions or repairs.
 Check for unexpected row or key cache settings.
 Look at the connections on the box and see if it is getting more client load
 than others.

 Restart the node and see if the situations returns. If it does try to
 correlate the start of the GC issues with the cassandra log.

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 17/08/2012, at 2:58 AM, Rene Kochen rene.koc...@schange.com wrote:

 Hi

 I have a cluster of 7 nodes:

 - Windows Server 2008
 - Cassandra 0.7.10
 - The nodes are identical (hardware, configuration and client request load)
 - Standard batch file with 8GB heap
 - I use disk_access_mode = standard
 - Random partitioner
 - TP stats shows no problems
 - Ring command shows no problems (data is balanced)

 However, there is one node with high read latency and far too many
 ParNew collections (compared to other nodes). It also suffers from a
 high CPU load (I guess due to the ParNew collections).

 What can be the source of so many ParNew collections? The other
 identical nodes do not have this behavior.

 Logging:

 2012-08-16 15:58:46,436906.072: [GC 906.072: [ParNew:
 345022K-38336K(345024K), 0.2375976 secs]
 3630599K-3516249K(8350272K), 0.2377296 secs] [Times: user=3.21
 sys=0.00, real=0.23 secs]
 2012-08-16 15:58:46,888906.517: [GC 906.517: [ParNew:
 345024K-38336K(345024K), 0.2400594 secs]
 3822937K-3743690K(8350272K), 0.2401802 secs] [Times: user=3.48
 sys=0.03, real=0.25 secs]
 2012-08-16 15:58:46,888 INFO 15:58:46,888 GC for ParNew: 478 ms for 2
 collections, 3837792904 used; max is 8550678528
 2012-08-16 15:58:47,372907.003: [GC 907.003: [ParNew:
 345024K-38336K(345024K), 0.2405363 secs]
 4050378K-3971544K(8350272K), 0.2406553 secs] [Times: user=3.34
 sys=0.01, real=0.23 secs]
 2012-08-16 15:58

Many ParNew collections

2012-08-16 Thread Rene Kochen
Hi

I have a cluster of 7 nodes:

- Windows Server 2008
- Cassandra 0.7.10
- The nodes are identical (hardware, configuration and client request load)
- Standard batch file with 8GB heap
- I use disk_access_mode = standard
- Random partitioner
- TP stats shows no problems
- Ring command shows no problems (data is balanced)

However, there is one node with high read latency and far too many
ParNew collections (compared to other nodes). It also suffers from a
high CPU load (I guess due to the ParNew collections).

What can be the source of so many ParNew collections? The other
identical nodes do not have this behavior.

Logging:

2012-08-16 15:58:46,436906.072: [GC 906.072: [ParNew:
345022K-38336K(345024K), 0.2375976 secs]
3630599K-3516249K(8350272K), 0.2377296 secs] [Times: user=3.21
sys=0.00, real=0.23 secs]
2012-08-16 15:58:46,888906.517: [GC 906.517: [ParNew:
345024K-38336K(345024K), 0.2400594 secs]
3822937K-3743690K(8350272K), 0.2401802 secs] [Times: user=3.48
sys=0.03, real=0.25 secs]
2012-08-16 15:58:46,888 INFO 15:58:46,888 GC for ParNew: 478 ms for 2
collections, 3837792904 used; max is 8550678528
2012-08-16 15:58:47,372907.003: [GC 907.003: [ParNew:
345024K-38336K(345024K), 0.2405363 secs]
4050378K-3971544K(8350272K), 0.2406553 secs] [Times: user=3.34
sys=0.01, real=0.23 secs]
2012-08-16 15:58:47,918907.544: [GC 907.544: [ParNew:
345024K-38336K(345024K), 0.2404339 secs]
4278232K-4193789K(8350272K), 0.2405540 secs] [Times: user=3.31
sys=0.00, real=0.25 secs]
2012-08-16 15:58:47,918 INFO 15:58:47,918 GC for ParNew: 481 ms for 2
collections, 4300939752 used; max is 8550678528
2012-08-16 15:58:48,464908.079: [GC 908.079: [ParNew:
345024K-38336K(345024K), 0.2621174 secs]
4500477K-4434112K(8350272K), 0.2622375 secs] [Times: user=3.64
sys=0.00, real=0.25 secs]
2012-08-16 15:58:48,932 INFO 15:58:48,932 GC for ParNew: 262 ms for 1
collections, 4763583200 used; max is 8550678528
2012-08-16 15:58:49,384909.050: [GC 909.051: [ParNew:
344972K-38336K(345024K), 0.2033453 secs]
4740748K-4563252K(8350272K), 0.2034588 secs] [Times: user=2.89
sys=0.01, real=0.20 secs]
2012-08-16 15:58:49,946 INFO 15:58:49,946 GC for ParNew: 203 ms for 1
collections, 4885945792 used; max is 8550678528
2012-08-16 15:58:50,383909.998: [GC 909.998: [ParNew:
345024K-38336K(345024K), 0.2567542 secs]
4869940K-4740489K(8350272K), 0.2568804 secs] [Times: user=3.60
sys=0.00, real=0.25 secs]
2012-08-16 15:58:50,882910.474: [GC 910.474: [ParNew:
345024K-38336K(345024K), 0.2786205 secs]
5047177K-4962531K(8350272K), 0.2787668 secs] [Times: user=3.48
sys=0.00, real=0.28 secs]
2012-08-16 15:58:50,960 INFO 15:58:50,960 GC for ParNew: 536 ms for 2
collections, 5143423816 used; max is 8550678528
2012-08-16 15:58:51,584911.192: [GC 911.192: [ParNew:
344963K-38334K(345024K), 0.2664316 secs]
5269158K-5196444K(8350272K), 0.2665544 secs] [Times: user=3.74
sys=0.00, real=0.27 secs]
2012-08-16 15:58:52,130911.767: [GC 911.767: [ParNew:
345022K-38336K(345024K), 0.2327209 secs]
5503132K-5406771K(8350272K), 0.2328457 secs] [Times: user=3.35
sys=0.00, real=0.23 secs]
2012-08-16 15:58:52,130 INFO 15:58:52,130 GC for ParNew: 499 ms for 2
collections, 5541845264 used; max is 8550678528
2012-08-16 15:58:52,816912.460: [GC 912.460: [ParNew:
345024K-38334K(345024K), 0.2198399 secs]
5713459K-5605670K(8350272K), 0.2199669 secs] [Times: user=3.29
sys=0.00, real=0.23 secs]
2012-08-16 15:58:53,144 INFO 15:58:53,144 GC for ParNew: 220 ms for 1
collections, 5805870608 used; max is 8550678528
2012-08-16 15:58:55,546915.173: [GC 915.173: [ParNew:
345022K-38334K(345024K), 0.2369585 secs]
5912358K-5702871K(8350272K), 0.2371098 secs] [Times: user=3.18
sys=0.00, real=0.25 secs]
2012-08-16 15:58:56,186 INFO 15:58:56,186 GC for ParNew: 237 ms for 1
collections, 6089002480 used; max is 8550678528
2012-08-16 15:58:56,591916.232: [GC 916.232: [ParNew:
345022K-38336K(345024K), 0.2364850 secs]
6009559K-5914142K(8350272K), 0.2366075 secs] [Times: user=3.32
sys=0.00, real=0.23 secs]
2012-08-16 15:58:57,340916.989: [GC 916.989: [ParNew:
345024K-38334K(345024K), 0.2191751 secs]
6220830K-6107217K(8350272K), 0.2192894 secs] [Times: user=2.92
sys=0.00, real=0.22 secs]
2012-08-16 15:58:57,371917.209: [GC [1 CMS-initial-mark:
6068883K(8005248K)] 6108716K(8350272K), 0.0272472 secs] [Times:
user=0.03 sys=0.00, real=0.03 secs]
2012-08-16 15:58:57,371917.236: [CMS-concurrent-mark-start]
2012-08-16 15:58:57,371 INFO 15:58:57,371 GC for ParNew: 456 ms for 2
collections, 6255865264 used; max is 8550678528
2012-08-16 15:58:57,574917.444: [CMS-concurrent-mark: 0.208/0.208
secs] [Times: user=1.48 sys=0.06, real=0.20 secs]
2012-08-16 15:58:57,574917.444: [CMS-concurrent-preclean-start]
2012-08-16 15:58:57,637917.501: [CMS-concurrent-preclean: 0.057/0.057
secs] [Times: user=0.19 sys=0.00, real=0.06 secs]
2012-08-16 15:58:57,637917.501: [CMS-concurrent-abortable-preclean-start]
2012-08-16 15:58:58,775918.552: [GC 918.552: [ParNew:
345022K-38334K(345024K), 0.0948325 secs]
6413905K-6163063K(8350272K), 0.0949661