Thanks guys! Jeff Jirsa helped me take a look, and I found a 10sec young gc
pause in the GC log.

3071128K->282000K(3495296K), 0.1144648 secs]
25943529K->23186623K(66409856K), 9.8971781 secs] [Times: user=2.33
sys=0.00, real=9.89 secs]

I'm trying to get a histogram or heap dump.


Thanks!



On Mon, Jan 23, 2017 at 7:08 PM, Brandon Williams <dri...@gmail.com> wrote:

> The lion's share of your drops are from cross-node timeouts, which require
> clock synchronization, so check that first.  If your clocks are synced,
> that means not only are you showing eager dropping based on time, but
> despite the eager dropping you are still facing overload.
>
> That local, non-gc pause is also troubling. (I assume non-gc since there
> wasn't anything logged by the GC inspector.)
>
> On Mon, Jan 23, 2017 at 12:36 AM, Dikang Gu <dikan...@gmail.com> wrote:
>
> > Hello there,
> >
> > We have a 100 nodes ish cluster, I find that there are dropped messages
> on
> > random nodes in the cluster, which caused error spikes and P99 latency
> > spikes as well.
> >
> > I tried to figure out the cause. I do not see any obvious bottleneck in
> > the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do
> > see some suspicious gossip events around that time, not sure if it's
> > related.
> >
> > 2017-01-21_16:43:56.71033 WARN  16:43:56 [GossipTasks:1]: Not marking
> > nodes down due to local pause of 13079498815 > 5000000000
> > 2017-01-21_16:43:56.85532 INFO  16:43:56 [ScheduledTasks:1]: MUTATION
> > messages were dropped in last 5000 ms: 65 for internal timeout and 10895
> > for cross node timeout
> > 2017-01-21_16:43:56.85533 INFO  16:43:56 [ScheduledTasks:1]: READ
> messages
> > were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross
> > node timeout
> > 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]: Pool Name
> >                Active   Pending      Completed   Blocked  All Time
> Blocked
> > 2017-01-21_16:43:56.85534 INFO  16:43:56 [ScheduledTasks:1]:
> MutationStage
> >                   128     47794     1015525068         0
>  0
> > 2017-01-21_16:43:56.85535
> > 2017-01-21_16:43:56.85535 INFO  16:43:56 [ScheduledTasks:1]: ReadStage
> >                    64     20202      450508940         0
>  0
> >
> > Any suggestions?
> >
> > Thanks!
> >
> > --
> > Dikang
> >
> >
>



-- 
Dikang

Reply via email to