The number of Completed HH tasks is interesting. AFAIK a task is started when 
the node detects another in the cluster has returned. Were you doing some other 
restarts around the cluster ?

I don't want to divert from the GC issue, just wondering if something else is 
going on as well.  Like the node is been asked to record a lot of hints. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 13 May 2011, at 03:51, Gabriel Tataranu wrote:

>>> What does the TPStats look like on the nodes under pressure ? And how many 
>>> nodes are delivering hints to the nodes when they restart?
> 
> $nodetool -h 127.0.0.1 tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         1         1        1992475
> RequestResponseStage              0         0        2247486
> MutationStage                     0         0        1631349
> ReadRepairStage                   0         0         583432
> GossipStage                       0         0         241324
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              0
> MemtablePostFlusher               0         0             46
> StreamStage                       0         0              0
> FlushWriter                       0         0             46
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              0
> HintedHandoff                     1         5            152
> 
> 
> dstat -cmdln during the event:
> 
> ----total-cpu-usage---- ------memory-usage----- ---load-avg---
> -dsk/total- -net/total-
> usr sys idl wai hiq siq| used  buff  cach  free| 1m   5m  15m | read
> writ| recv  send
> 87   6   6   0   0   1|6890M 32.1M 1001M 42.8M|2.36 2.87 1.73|   0
> 0 |  75k  144k
> 88  10   2   0   0   0|6889M 32.2M 1002M 41.6M|3.05 3.00 1.78|   0
> 0 |  60k  102k
> 89   9   2   0   0   0|6890M 32.2M 1003M 41.0M|3.05 3.00 1.78|   0
> 0 |  38k   70k
> 89  10   1   0   0   0|6890M 32.2M 1003M 40.7M|3.05 3.00 1.78|   0
> 0 |  26k   24k
> 93   6   2   0   0   0|6890M 32.2M 1003M 40.9M|3.05 3.00 1.78|   0
> 0 |  37k   31k
> 90   8   2   0   0   0|6890M 32.2M 1003M 39.9M|3.05 3.00 1.78|   0
> 0 |  67k   69k
> 87   8   4   0   0   1|6890M 32.2M 1004M 38.7M|4.09 3.22 1.85|   0
> 0 | 123k  262k
> 83  13   2   0   0   2|6890M 32.2M 1004M 38.3M|4.09 3.22 1.85|   0
> 0 | 445k   18M
> 90   6   3   0   0   0|6890M 32.2M 1005M 38.2M|4.09 3.22 1.85|   0
> 0 |  72k   91k
> 40   7  25  27   0   0|6890M 32.2M 1005M 37.8M|4.09 3.22 1.85|   0
> 0 | 246k 8034k
>  0   0  59  41   0   0|6890M 32.2M 1005M 37.7M|4.09 3.22 1.85|   0
> 0 |  19k 6490B
>  1   2  45  52   0   0|6891M 32.2M  999M 43.1M|4.00 3.21 1.86|   0
> 0 |  29k   18k
> 72   8  15   3   0   1|6892M 32.2M  999M 41.6M|4.00 3.21 1.86|   0
> 0 | 431k   11M
> 88   9   2   0   0   1|6907M 32.0M  985M 41.1M|4.00 3.21 1.86|   0
> 0 |  99k   77k
> 88  10   1   0   0   1|6913M 31.9M  977M 44.1M|4.00 3.21 1.86|   0
> 0 | 112k  619k
> 89   9   1   0   0   1|6892M 31.9M  977M 64.4M|4.00 3.21 1.86|   0
> 0 | 109k  369k
> 90   8   1   0   0   0|6892M 31.9M  979M 62.5M|4.80 3.39 1.92|   0
> 0 | 130k   97k
> 83  13   1   0   0   3|6893M 32.0M  981M 59.8M|4.80 3.39 1.92|   0
> 0 | 503k   18M
> 78  11  10   0   0   0|6893M 32.0M  981M 59.5M|4.80 3.39 1.92|   0
> 0 | 102k  110k
> 
> 
> The low cpu periods are due to major GC (JVM frozen).
> 
>> 
>> TPStats do show activity on the HH. I'll have some examples latter if
>> the nodes decide to do this again.
>> 
>>> 
>>> Finally hinted_handoff_throttle_delay_in_ms in conf/cassandra.yaml will let 
>>> you slow down the delivery rate if HH is indeed the problem. 
>> 
> 
> 
> Best,
> 
> Gabriel
> 

Reply via email to