>> What does the TPStats look like on the nodes under pressure ? And how many 
>> nodes are delivering hints to the nodes when they restart?

$nodetool -h 127.0.0.1 tpstats
Pool Name                    Active   Pending      Completed
ReadStage                         1         1        1992475
RequestResponseStage              0         0        2247486
MutationStage                     0         0        1631349
ReadRepairStage                   0         0         583432
GossipStage                       0         0         241324
AntiEntropyStage                  0         0              0
MigrationStage                    0         0              0
MemtablePostFlusher               0         0             46
StreamStage                       0         0              0
FlushWriter                       0         0             46
MiscStage                         0         0              0
FlushSorter                       0         0              0
InternalResponseStage             0         0              0
HintedHandoff                     1         5            152


dstat -cmdln during the event:

----total-cpu-usage---- ------memory-usage----- ---load-avg---
-dsk/total- -net/total-
usr sys idl wai hiq siq| used  buff  cach  free| 1m   5m  15m | read
writ| recv  send
 87   6   6   0   0   1|6890M 32.1M 1001M 42.8M|2.36 2.87 1.73|   0
0 |  75k  144k
 88  10   2   0   0   0|6889M 32.2M 1002M 41.6M|3.05 3.00 1.78|   0
0 |  60k  102k
 89   9   2   0   0   0|6890M 32.2M 1003M 41.0M|3.05 3.00 1.78|   0
0 |  38k   70k
 89  10   1   0   0   0|6890M 32.2M 1003M 40.7M|3.05 3.00 1.78|   0
0 |  26k   24k
 93   6   2   0   0   0|6890M 32.2M 1003M 40.9M|3.05 3.00 1.78|   0
0 |  37k   31k
 90   8   2   0   0   0|6890M 32.2M 1003M 39.9M|3.05 3.00 1.78|   0
0 |  67k   69k
 87   8   4   0   0   1|6890M 32.2M 1004M 38.7M|4.09 3.22 1.85|   0
0 | 123k  262k
 83  13   2   0   0   2|6890M 32.2M 1004M 38.3M|4.09 3.22 1.85|   0
0 | 445k   18M
 90   6   3   0   0   0|6890M 32.2M 1005M 38.2M|4.09 3.22 1.85|   0
0 |  72k   91k
 40   7  25  27   0   0|6890M 32.2M 1005M 37.8M|4.09 3.22 1.85|   0
0 | 246k 8034k
  0   0  59  41   0   0|6890M 32.2M 1005M 37.7M|4.09 3.22 1.85|   0
0 |  19k 6490B
  1   2  45  52   0   0|6891M 32.2M  999M 43.1M|4.00 3.21 1.86|   0
0 |  29k   18k
 72   8  15   3   0   1|6892M 32.2M  999M 41.6M|4.00 3.21 1.86|   0
0 | 431k   11M
 88   9   2   0   0   1|6907M 32.0M  985M 41.1M|4.00 3.21 1.86|   0
0 |  99k   77k
 88  10   1   0   0   1|6913M 31.9M  977M 44.1M|4.00 3.21 1.86|   0
0 | 112k  619k
 89   9   1   0   0   1|6892M 31.9M  977M 64.4M|4.00 3.21 1.86|   0
0 | 109k  369k
 90   8   1   0   0   0|6892M 31.9M  979M 62.5M|4.80 3.39 1.92|   0
0 | 130k   97k
 83  13   1   0   0   3|6893M 32.0M  981M 59.8M|4.80 3.39 1.92|   0
0 | 503k   18M
 78  11  10   0   0   0|6893M 32.0M  981M 59.5M|4.80 3.39 1.92|   0
0 | 102k  110k


The low cpu periods are due to major GC (JVM frozen).

> 
> TPStats do show activity on the HH. I'll have some examples latter if
> the nodes decide to do this again.
> 
>>
>> Finally hinted_handoff_throttle_delay_in_ms in conf/cassandra.yaml will let 
>> you slow down the delivery rate if HH is indeed the problem. 
> 


Best,

Gabriel

Reply via email to