Thanks, Roshan, you hit the nail head-on in terms of what I think the issue is. I set TOPOLOGY_DISRUPTOR_BATCH_TIMEOUT_MILLIS to -1 on all my processes. Unfortunately, it seems LiteBlockingWaitStrategy doesn't help much. A system-wide perf ( perf record -a -F 1000 sleep 15) of an affected system with all topologies using LiteBlockingWaitStrategy: [image: Inline image 3] You can see that while most of the actual time is spent context-switching, basically everything except that showing up top is futex-related, which is perhaps triggering the context-switches. I suspect that having this many threads will just keep on being an issue.
I noticed that the system's average load is higher than the amount of physical cores on the system despite it not being 100% occupied. This implies scheduler waits. Also, I recorded a single second of context switching using perf sched record -- sleep 1 and the results were troubling (reporting using perf sched latency): ----------------------------------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at | ----------------------------------------------------------------------------------------------------------------- ... perf:22515 | 764.640 ms | 20 | avg: 0.125 ms | max: 0.776 ms | max at: 1851448.107336 s kworker/u30:0:10511 | 606.032 ms | 45 | avg: 0.028 ms | max: 0.548 ms | max at: 1851448.138279 s java:(1595) | 5818.811 ms | 430314 | avg: 0.022 ms | max: 46.624 ms | max at: 1851447.880011 s ... ----------------------------------------------------------------------------------------------------------------- TOTAL: | 7289.697 ms | 431795 | --------------------------------------------------- Granted, "java" means around 4000 threads aggregated, but, and this is a big but, it means >100 context switches per thread per second! Insanity! (note: 7290ms total is because it's cpu time, not real time). This is with LiteBlockingWaitStrategy. Finally, I noticed the related setting, topology.disruptor.batch.timeout.millis which controls the send side of waking up. I'm perfectly willing to trade off a few 100ms latency for throughput here, so I raised it to 20ms and this made all the problems go away. On Tue, Jun 13, 2017 at 2:32 AM, Roshan Naik <[email protected]> wrote: > Let’s see if I can help… > > I do see that the ability to customize the wait strategy it via config > file was removed sometime after v0.10.2. So there is no way to do that > anymore but you can try to tweak topology.disruptor.wait.timeout.millis > and see if that helps your situation. > > > > The TimeoutBlockingWaitStrategy is what is being used for the Disruptor > currently if the timeout>0 … or LiteBlockingWaitStrategy (if timeout=0) > > > > Is your primary concern that too much kernel CPU is being used in idle > mode (in the wait strategy) and consequently it is affecting your ability > to run more topos on the same box ? > > > > -Roshan > > > > *From: *Roee Shenberg <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Monday, June 12, 2017 at 10:06 AM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: High kernel cpu usage when running 10 topologies on the > same worker > > > > Some better diagnostics: I got a perf recording (30secs) of an idle > topology on a system with 10 idle topologies, using perf-map-agent to > create a .map file for jitted code. The flame graph is attached, and it > strengthens my desire to use a waiting strategy that isn't based on > condition variables seeing as the vast majority of the time was spent in > futexes. > > > > > > On Sun, Jun 11, 2017 at 6:39 PM, Roee Shenberg <[email protected]> wrote: > > A more practical question is - is there a way to replace the disruptor > waiting strategy? I suspect the issue is the ginormous amount of threads > vying to be woken up, running the disruptor queue with SleepWaitStrategy > seems like it should alleviate this pain, but it seems the ability to set > the wait strategy was removed in the transition from clojure to java in > versions 0.10.x->1.0.x. > > > > On Sun, Jun 11, 2017 at 3:32 PM, Roee Shenberg <[email protected]> wrote: > > Our storm cluster (1.0.2) is running many trident topologies, each one is > local to a single worker, with each supervisor having 10 worker slots. > Every slot runs a copy of the same topology with a different configuration, > the topology being a fairly fat trident topology (e.g. ~300 threads per > topology - totalling >3000 threads on the machine) > > A quick htop showed a grim picture of most CPU time being spent in the > kernel: > [image: nline image 1] > > (note: running as root inside a docker) > > Here's an example top summary line: > > %Cpu(s):* 39.4 *us,* 51.1 *sy,* 0.0 *ni,* 8.6 *id,* 0.0 *wa,* 0.0 * > hi,* 0.1 *si,* 0.8 *st > > This suggests actual kernel time waste, not I/O, irqs, etc, so I ran sudo > strace -cf -p 2466 to get a feel for what's going on: > > > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 86.84 3489.872073 14442 241646 27003 futex > > 10.69 429.437949 271453 1582 epoll_wait > > 1.88 75.608000 361761 209 108 restart_syscall > > 0.29 11.722287 46889 250 recvfrom > > 0.12 4.736911 92 51379 gettimeofday > > 0.08 3.173336 12 254162 clock_gettime > > 0.06 2.234660 4373 511 poll > > ... > > > > I don't understand whether threads that are simply blocking are counted > (in which case this is a worthless measure) or not. > > > > I ran jvmtop to get some runtime data out of one of the topologies (well, > a few, but they were all roughly the same): > > > > TID NAME STATE CPU TOTALCPU > BLOCKEDBY > > 203 Thread-27-$mastercoord-bg0-exe RUNNABLE 57.55% 8.54% > > > 414 RMI TCP Connection(3)-172.17.0 RUNNABLE 8.03% 0.01% > > > 22 disruptor-flush-trigger TIMED_WAITING 3.79% 4.49% > > > 51 heartbeat-timer TIMED_WAITING 0.80% 1.66% > > > 328 disruptor-flush-task-pool TIMED_WAITING 0.61% 0.84% > > ... > > So just about all of the time is spent inside the trident master > coordinator. > > My theory is that the ridiculous thread count is causing the kernel to > work extra-hard on all those futex calls (e.g. when waking a thread > blocking on a futex). > > I'm very uncertain regarding this, the best I can say is that the overhead > is related more to the number of topologies than to what the topology is > doing (when running 1 topology on the same worker, cpu use is less than > 1/10th of what it is with 10 topologies), and there are a *lot* of > threads on the system (>3000). > > Any advice, suggestions for additional diagnostics, or ideas as to the > cause? Operationally, we're planning moving to smaller instances with less > slots per worker to work around this issue, but I'd rather resolve it > without changing our cluster's makeup entirely. > > Thanks, > > Roee > > > > >
