Our storm cluster (1.0.2) is running many trident topologies, each one is
local to a single worker, with each supervisor having 10 worker slots.
Every slot runs a copy of the same topology with a different configuration,
the topology being a fairly fat trident topology (e.g. ~300 threads per
topology - totalling >3000 threads on the machine)

A quick htop showed a grim picture of most CPU time being spent in the
kernel:
[image: Inline image 1]
(note: running as root inside a docker)

Here's an example top summary line:

%Cpu(s):* 39.4 *us,* 51.1 *sy,*  0.0 *ni,*  8.6 *id,*  0.0 *wa,*  0.0 *hi,*
0.1 *si,*  0.8 *st
This suggests actual kernel time waste, not I/O, irqs, etc, so I ran sudo
strace -cf -p 2466 to get a feel for what's going on:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 86.84 3489.872073       14442    241646     27003 futex
 10.69  429.437949      271453      1582           epoll_wait
  1.88   75.608000      361761       209       108 restart_syscall
  0.29   11.722287       46889       250           recvfrom
  0.12    4.736911          92     51379           gettimeofday
  0.08    3.173336          12    254162           clock_gettime
  0.06    2.234660        4373       511           poll
...

I don't understand whether threads that are simply blocking are counted (in
which case this is a worthless measure) or not.

I ran jvmtop to get some runtime data out of one of the topologies (well, a
few, but they were all roughly the same):

  TID   NAME                                    STATE    CPU  TOTALCPU
BLOCKEDBY

    203 Thread-27-$mastercoord-bg0-exe       RUNNABLE 57.55%     8.54%


    414 RMI TCP Connection(3)-172.17.0       RUNNABLE  8.03%     0.01%


     22 disruptor-flush-trigger         TIMED_WAITING  3.79%     4.49%


     51 heartbeat-timer                 TIMED_WAITING  0.80%     1.66%


    328 disruptor-flush-task-pool       TIMED_WAITING  0.61%     0.84%

...

So just about all of the time is spent inside the trident master
coordinator.

My theory is that the ridiculous thread count is causing the kernel to work
extra-hard on all those futex calls (e.g. when waking a thread blocking on
a futex).

I'm very uncertain regarding this, the best I can say is that the overhead
is related more to the number of topologies than to what the topology is
doing (when running 1 topology on the same worker, cpu use is less than
1/10th of what it is with 10 topologies), and there are a *lot* of threads
on the system (>3000).

Any advice, suggestions for additional diagnostics, or ideas as to the
cause? Operationally, we're planning moving to smaller instances with less
slots per worker to work around this issue, but I'd rather resolve it
without changing our cluster's makeup entirely.

Thanks,

Roee

Reply via email to