Our storm cluster (1.0.2) is running many trident topologies, each one is
local to a single worker, with each supervisor having 10 worker slots.
Every slot runs a copy of the same topology with a different configuration,
the topology being a fairly fat trident topology (e.g. ~300 threads per
topology - totalling >3000 threads on the machine)
A quick htop showed a grim picture of most CPU time being spent in the
kernel:
[image: Inline image 1]
(note: running as root inside a docker)
Here's an example top summary line:
%Cpu(s):* 39.4 *us,* 51.1 *sy,* 0.0 *ni,* 8.6 *id,* 0.0 *wa,* 0.0 *hi,*
0.1 *si,* 0.8 *st
This suggests actual kernel time waste, not I/O, irqs, etc, so I ran sudo
strace -cf -p 2466 to get a feel for what's going on:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
86.84 3489.872073 14442 241646 27003 futex
10.69 429.437949 271453 1582 epoll_wait
1.88 75.608000 361761 209 108 restart_syscall
0.29 11.722287 46889 250 recvfrom
0.12 4.736911 92 51379 gettimeofday
0.08 3.173336 12 254162 clock_gettime
0.06 2.234660 4373 511 poll
...
I don't understand whether threads that are simply blocking are counted (in
which case this is a worthless measure) or not.
I ran jvmtop to get some runtime data out of one of the topologies (well, a
few, but they were all roughly the same):
TID NAME STATE CPU TOTALCPU
BLOCKEDBY
203 Thread-27-$mastercoord-bg0-exe RUNNABLE 57.55% 8.54%
414 RMI TCP Connection(3)-172.17.0 RUNNABLE 8.03% 0.01%
22 disruptor-flush-trigger TIMED_WAITING 3.79% 4.49%
51 heartbeat-timer TIMED_WAITING 0.80% 1.66%
328 disruptor-flush-task-pool TIMED_WAITING 0.61% 0.84%
...
So just about all of the time is spent inside the trident master
coordinator.
My theory is that the ridiculous thread count is causing the kernel to work
extra-hard on all those futex calls (e.g. when waking a thread blocking on
a futex).
I'm very uncertain regarding this, the best I can say is that the overhead
is related more to the number of topologies than to what the topology is
doing (when running 1 topology on the same worker, cpu use is less than
1/10th of what it is with 10 topologies), and there are a *lot* of threads
on the system (>3000).
Any advice, suggestions for additional diagnostics, or ideas as to the
cause? Operationally, we're planning moving to smaller instances with less
slots per worker to work around this issue, but I'd rather resolve it
without changing our cluster's makeup entirely.
Thanks,
Roee