I'm still struggling with finding the root cause for such CPU utilisation patterns.
http://i58.tinypic.com/24pifcy.jpg After a 3 weeks after C* restart CPU utilisation is going through the roof, such situation isn't happening shortly after the restart (which is visible at the graph). C* is running on the machines with 16 cores, HEAP_NEWSIZE was set to 1600MB. Looking at gc logs I found: 2015-07-27T03:27:41.891+0000: 3354764.688: [ParNew: 1362069K->51348K(1474560K), 0.0942700 secs] 3268635K->1957914K(8224768K)After GC: 2015-07-27T03:27:41.999+0000: 3354764.797: [ParNew: 55327K->48503K(1474560K), 0.0935170 secs] 1961894K->1955069K(8224768K)After GC: 2015-07-27T03:27:43.464+0000: 3354766.261: [ParNew: 1359224K->54214K(1474560K), 0.0922640 secs] 3265790K->1960780K(8224768K)After GC: 2015-07-27T03:27:43.570+0000: 3354766.368: [ParNew: 56165K->45672K(1474560K), 0.0948220 secs] 1962732K->1952239K(8224768K)After GC: 2015-07-27T03:27:45.016+0000: 3354767.814: [ParNew: 1356393K->54245K(1474560K), 0.0922290 secs] 3262959K->1960811K(8224768K)After GC: 2015-07-27T03:27:45.121+0000: 3354767.919: [ParNew: 57866K->48553K(1474560K), 0.0928670 secs] 1964433K->1955119K(8224768K)After GC: 2015-07-27T03:27:46.590+0000: 3354769.387: [ParNew: 1359420K->48560K(1474560K), 0.0913750 secs] 3265986K->1955126K(8224768K)After GC: It's clear to me that ParNew needs to be executed when 1359420K out of 1474560K is taken. But why it's executed when only 57866K/1474560K - ~4% of young is occupied? SurvivorRatio is set to 8 in our case. Also what's bothering me is the time of CMS' concurreent sweep time: gc.log.9:2015-07-27T07:54:36.561+0000: 3370772.285: [CMS-concurrent-sweep: 19.710/20.600 secs] [Times: user=0.00 sys=317.60, real=20.60 secs] Isn't it quite high? Do you guys have any other thoughts? Which part of gc logs requires more attention? On Tue, Jul 21, 2015 at 4:22 PM, Marcin Pietraszek <mpietras...@opera.com> wrote: > Yup... it seems like it's gc fault > > gc logs > > 2015-07-21T14:19:54.336+0000: 2876133.270: Total time for which > application threads were stopped: 0.0832030 seconds > 2015-07-21T14:19:55.739+0000: 2876134.673: Total time for which > application threads were stopped: 0.0806960 seconds > 2015-07-21T14:19:57.149+0000: 2876136.083: Total time for which > application threads were stopped: 0.0806890 seconds > 2015-07-21T14:19:58.550+0000: 2876137.484: Total time for which > application threads were stopped: 0.0821070 seconds > 2015-07-21T14:19:59.941+0000: 2876138.875: Total time for which > application threads were stopped: 0.0802640 seconds > 2015-07-21T14:20:01.340+0000: 2876140.274: Total time for which > application threads were stopped: 0.0835670 seconds > 2015-07-21T14:20:02.744+0000: 2876141.678: Total time for which > application threads were stopped: 0.0842440 seconds > 2015-07-21T14:20:04.143+0000: 2876143.077: Total time for which > application threads were stopped: 0.0841630 seconds > 2015-07-21T14:20:05.541+0000: 2876144.475: Total time for which > application threads were stopped: 0.0839850 seconds > > Heap after GC invocations=2273737 (full 101): > par new generation total 1474560K, used 106131K > [0x00000005fae00000, 0x000000065ee00000, 0x000000065ee00000) > eden space 1310720K, 0% used [0x00000005fae00000, > 0x00000005fae00000, 0x000000064ae00000) > from space 163840K, 64% used [0x000000064ae00000, > 0x00000006515a4ee0, 0x0000000654e00000) > to space 163840K, 0% used [0x0000000654e00000, > 0x0000000654e00000, 0x000000065ee00000) > concurrent mark-sweep generation total 6750208K, used 1316691K > [0x000000065ee00000, 0x00000007fae00000, 0x00000007fae00000) > concurrent-mark-sweep perm gen total 49336K, used 29520K > [0x00000007fae00000, 0x00000007fde2e000, 0x0000000800000000) > } > 2015-07-21T14:12:05.683+0000: 2875664.617: Total time for which > application threads were stopped: 0.0830280 seconds > {Heap before GC invocations=2273737 (full 101): > par new generation total 1474560K, used 1416851K > [0x00000005fae00000, 0x000000065ee00000, 0x000000065ee00000) > eden space 1310720K, 100% used [0x00000005fae00000, > 0x000000064ae00000, 0x000000064ae00000) > from space 163840K, 64% used [0x000000064ae00000, > 0x00000006515a4ee0, 0x0000000654e00000) > to space 163840K, 0% used [0x0000000654e00000, > 0x0000000654e00000, 0x000000065ee00000) > concurrent mark-sweep generation total 6750208K, used 1316691K > [0x000000065ee00000, 0x00000007fae00000, 0x00000007fae00000) > concurrent-mark-sweep perm gen total 49336K, used 29520K > [0x00000007fae00000, 0x00000007fde2e000, 0x0000000800000000) > > It seems like eden heap space is being constantly occupied by > something which is later removed by gc... > > > On Mon, Jul 20, 2015 at 9:18 AM, Jason Wee <peich...@gmail.com> wrote: >> just a guess, gc? >> >> On Mon, Jul 20, 2015 at 3:15 PM, Marcin Pietraszek <mpietras...@opera.com> >> wrote: >>> >>> Hello! >>> >>> I've noticed a strange CPU utilisation patterns on machines in our >>> cluster. After C* daemon restart it behaves in a normal way, after a >>> few weeks since a restart CPU usage starts to raise. Currently on one >>> of the nodes (screenshots attached) cpu load is ~4. Shortly before >>> restart load raises to ~15 (our cassandra machines have 16 cpus). >>> >>> In that cluster we're using bulkloading from hadoop cluster with 1400 >>> reducers (200 parallel bulkloading tasks). After such session of heavy >>> bulkloading number of pending compactions is quite high but it's able >>> to clear them before next "bulkloading session". We're also tracking >>> number of pending compactions and during most of the time it's 0. >>> >>> On our machines we do have a few gigs of free memory ~7GB (17GB used), >>> also it seems like we aren't IO bound. >>> >>> Screenshots from our zabbix with CPU utilisation graphs: >>> >>> http://i60.tinypic.com/xas8q8.jpg >>> http://i58.tinypic.com/24pifcy.jpg >>> >>> Do you guys know what could be causing such high load? >>> >>> -- >>> mp >> >>