Re: Dropped messages on random nodes.

2017-01-22 Thread Dikang Gu
Btw, the C* version is 2.2.5, with several backported patches.

On Sun, Jan 22, 2017 at 10:36 PM, Dikang Gu  wrote:

> Hello there,
>
> We have a 100 nodes ish cluster, I find that there are dropped messages on
> random nodes in the cluster, which caused error spikes and P99 latency
> spikes as well.
>
> I tried to figure out the cause. I do not see any obvious bottleneck in
> the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do
> see some suspicious gossip events around that time, not sure if it's
> related.
>
> 2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking
> nodes down due to local pause of 13079498815 > 50
> 2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
> messages were dropped in last 5000 ms: 65 for internal timeout and 10895
> for cross node timeout
> 2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ messages
> were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
> node timeout
> 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
>Active   Pending  Completed   Blocked  All Time Blocked
> 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: MutationStage
>   128 47794 1015525068 0 0
> 2017-01-21_16:43:56.85535
> 2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
>64 20202  450508940 0 0
>
> Any suggestions?
>
> Thanks!
>
> --
> Dikang
>
>


-- 
Dikang


Dropped messages on random nodes.

2017-01-22 Thread Dikang Gu
Hello there,

We have a 100 nodes ish cluster, I find that there are dropped messages on
random nodes in the cluster, which caused error spikes and P99 latency
spikes as well.

I tried to figure out the cause. I do not see any obvious bottleneck in the
cluster, the C* nodes still have plenty of cpu idle/disk io. But I do see
some suspicious gossip events around that time, not sure if it's related.

2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking nodes
down due to local pause of 13079498815 > 50
2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
messages were dropped in last 5000 ms: 65 for internal timeout and 10895
for cross node timeout
2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ messages
were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
node timeout
2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
   Active   Pending  Completed   Blocked  All Time Blocked
2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: MutationStage
  128 47794 1015525068 0 0
2017-01-21_16:43:56.85535
2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
   64 20202  450508940 0 0

Any suggestions?

Thanks!

-- 
Dikang