Dikang,

Did you take a look at the heap health on those nodes? A quick heap
histogram or dump would help you figure out if it is related to data
issue(wide rows, or bad model)  where few nodes may be coming under heap
pressure and dropping messages.

Thanks,
Roopa



*Regards,*

*Roopa Tangirala*

Engineering Manager CDE

*(408) 438-3156 - mobile*





On Mon, Jan 23, 2017 at 4:55 PM, Blake Eggleston <beggles...@apple.com>
wrote:

> Hi Dikang,
>
> Do you have any GC logging or metrics you can correlate with the dropped
> messages? A 13 second pause sounds like a bad GC pause.
>
> Thanks,
>
> Blake
>
>
> On January 22, 2017 at 10:37:22 PM, Dikang Gu (dikan...@gmail.com) wrote:
>
> Btw, the C* version is 2.2.5, with several backported patches.
>
> On Sun, Jan 22, 2017 at 10:36 PM, Dikang Gu <dikan...@gmail.com> wrote:
>
> > Hello there,
> >
> > We have a 100 nodes ish cluster, I find that there are dropped messages
> on
> > random nodes in the cluster, which caused error spikes and P99 latency
> > spikes as well.
> >
> > I tried to figure out the cause. I do not see any obvious bottleneck in
> > the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do
> > see some suspicious gossip events around that time, not sure if it's
> > related.
> >
> > 2017-01-21_16:43:56.71033 WARN 16:43:56 [GossipTasks:1]: Not marking
> > nodes down due to local pause of 13079498815 > 5000000000
> > 2017-01-21_16:43:56.85532 INFO 16:43:56 [ScheduledTasks:1]: MUTATION
> > messages were dropped in last 5000 ms: 65 for internal timeout and 10895
> > for cross node timeout
> > 2017-01-21_16:43:56.85533 INFO 16:43:56 [ScheduledTasks:1]: READ messages
> > were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
> > node timeout
> > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: Pool Name
> > Active Pending Completed Blocked All Time Blocked
> > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: MutationStage
> > 128 47794 1015525068 0 0
> > 2017-01-21_16:43:56.85535
> > 2017-01-21_16:43:56.85535 INFO 16:43:56 [ScheduledTasks:1]: ReadStage
> > 64 20202 450508940 0 0
> >
> > Any suggestions?
> >
> > Thanks!
> >
> > --
> > Dikang
> >
> >
>
>
> --
> Dikang
>

Reply via email to