Thanks guys! Jeff Jirsa helped me take a look, and I found a 10sec young gc pause in the GC log.
3071128K->282000K(3495296K), 0.1144648 secs] 25943529K->23186623K(66409856K), 9.8971781 secs] [Times: user=2.33 sys=0.00, real=9.89 secs] I'm trying to get a histogram or heap dump. Thanks! On Mon, Jan 23, 2017 at 7:08 PM, Brandon Williams <dri...@gmail.com> wrote: > The lion's share of your drops are from cross-node timeouts, which require > clock synchronization, so check that first. If your clocks are synced, > that means not only are you showing eager dropping based on time, but > despite the eager dropping you are still facing overload. > > That local, non-gc pause is also troubling. (I assume non-gc since there > wasn't anything logged by the GC inspector.) > > On Mon, Jan 23, 2017 at 12:36 AM, Dikang Gu <dikan...@gmail.com> wrote: > > > Hello there, > > > > We have a 100 nodes ish cluster, I find that there are dropped messages > on > > random nodes in the cluster, which caused error spikes and P99 latency > > spikes as well. > > > > I tried to figure out the cause. I do not see any obvious bottleneck in > > the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do > > see some suspicious gossip events around that time, not sure if it's > > related. > > > > 2017-01-21_16:43:56.71033 WARN 16:43:56 [GossipTasks:1]: Not marking > > nodes down due to local pause of 13079498815 > 5000000000 > > 2017-01-21_16:43:56.85532 INFO 16:43:56 [ScheduledTasks:1]: MUTATION > > messages were dropped in last 5000 ms: 65 for internal timeout and 10895 > > for cross node timeout > > 2017-01-21_16:43:56.85533 INFO 16:43:56 [ScheduledTasks:1]: READ > messages > > were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross > > node timeout > > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: Pool Name > > Active Pending Completed Blocked All Time > Blocked > > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: > MutationStage > > 128 47794 1015525068 0 > 0 > > 2017-01-21_16:43:56.85535 > > 2017-01-21_16:43:56.85535 INFO 16:43:56 [ScheduledTasks:1]: ReadStage > > 64 20202 450508940 0 > 0 > > > > Any suggestions? > > > > Thanks! > > > > -- > > Dikang > > > > > -- Dikang