Thanks, Roshan, you hit the nail head-on in terms of what I think the issue
is. I set TOPOLOGY_DISRUPTOR_BATCH_TIMEOUT_MILLIS to -1 on all my
processes. Unfortunately, it seems LiteBlockingWaitStrategy doesn't help
much. A system-wide perf (
perf record -a -F 1000 sleep 15) of an affected system with all topologies
using LiteBlockingWaitStrategy:
[image: Inline image 3]
You can see that while most of the actual time is spent context-switching,
basically everything except that showing up top is futex-related, which is
perhaps triggering the context-switches. I suspect that having this many
threads will just keep on being an issue.

I noticed that the system's average load is higher than the amount of
physical cores on the system despite it not being 100% occupied. This
implies scheduler waits. Also, I recorded a single second of context
switching using perf sched record -- sleep 1 and the results were troubling
(reporting using perf sched latency):


-----------------------------------------------------------------------------------------------------------------

  Task                  |   Runtime ms  | Switches | Average delay ms |
Maximum delay ms | Maximum delay at       |


-----------------------------------------------------------------------------------------------------------------

...

  perf:22515            |    764.640 ms |       20 | avg:    0.125 ms | max:
  0.776 ms | max at: 1851448.107336 s

  kworker/u30:0:10511   |    606.032 ms |       45 | avg:    0.028 ms | max:
  0.548 ms | max at: 1851448.138279 s

  java:(1595)           |   5818.811 ms |   430314 | avg:    0.022 ms |
max:   46.624 ms | max at: 1851447.880011 s

...


-----------------------------------------------------------------------------------------------------------------

  TOTAL:                |   7289.697 ms |   431795 |

 ---------------------------------------------------
Granted, "java" means around 4000 threads aggregated, but, and this is a
big but, it means >100 context switches per thread per second! Insanity!
(note: 7290ms total is because it's cpu time, not real time). This is with
LiteBlockingWaitStrategy.

Finally, I noticed the related setting, topology.disruptor.batch.timeout.millis
which controls the send side of waking up. I'm perfectly willing to trade
off a few 100ms latency for throughput here, so I raised it to 20ms and
this made all the problems go away.


On Tue, Jun 13, 2017 at 2:32 AM, Roshan Naik <[email protected]> wrote:

> Let’s see if I can help…
>
>   I do see that the ability to customize the wait strategy  it via config
> file was removed sometime after v0.10.2.   So there is no way to do that
> anymore but you can try to tweak topology.disruptor.wait.timeout.millis
> and see if that helps your situation.
>
>
>
> The TimeoutBlockingWaitStrategy is what is being used for the Disruptor
> currently if the timeout>0 … or LiteBlockingWaitStrategy (if timeout=0)
>
>
>
>   Is your primary concern that too much kernel CPU is being used in idle
> mode (in the wait strategy)  and consequently it is affecting your ability
> to run more topos on the same box ?
>
>
>
> -Roshan
>
>
>
> *From: *Roee Shenberg <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Monday, June 12, 2017 at 10:06 AM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: High kernel cpu usage when running 10 topologies on the
> same worker
>
>
>
> Some better diagnostics: I got a perf recording (30secs) of an idle
> topology on a system with 10 idle topologies, using perf-map-agent to
> create a .map file for jitted code. The flame graph is attached, and it
> strengthens my desire to use a waiting strategy that isn't based on
> condition variables seeing as the vast majority of the time was spent in
> futexes.
>
>
>
>
>
> On Sun, Jun 11, 2017 at 6:39 PM, Roee Shenberg <[email protected]> wrote:
>
> A more practical question is - is there a way to replace the disruptor
> waiting strategy? I suspect the issue is the ginormous amount of threads
> vying to be woken up, running the disruptor queue with SleepWaitStrategy
> seems like it should alleviate this pain, but it seems the ability to set
> the wait strategy was removed in the transition from clojure to java in
> versions 0.10.x->1.0.x.
>
>
>
> On Sun, Jun 11, 2017 at 3:32 PM, Roee Shenberg <[email protected]> wrote:
>
> Our storm cluster (1.0.2) is running many trident topologies, each one is
> local to a single worker, with each supervisor having 10 worker slots.
> Every slot runs a copy of the same topology with a different configuration,
> the topology being a fairly fat trident topology (e.g. ~300 threads per
> topology - totalling >3000 threads on the machine)
>
> A quick htop showed a grim picture of most CPU time being spent in the
> kernel:
> [image: nline image 1]
>
> (note: running as root inside a docker)
>
> Here's an example top summary line:
>
> %Cpu(s):* 39.4 *us,* 51.1 *sy,*  0.0 *ni,*  8.6 *id,*  0.0 *wa,*  0.0 *
> hi,*  0.1 *si,*  0.8 *st
>
> This suggests actual kernel time waste, not I/O, irqs, etc, so I ran sudo
> strace -cf -p 2466 to get a feel for what's going on:
>
>
>
> % time     seconds  usecs/call     calls    errors syscall
>
> ------ ----------- ----------- --------- --------- ----------------
>
>  86.84 3489.872073       14442    241646     27003 futex
>
>  10.69  429.437949      271453      1582           epoll_wait
>
>   1.88   75.608000      361761       209       108 restart_syscall
>
>   0.29   11.722287       46889       250           recvfrom
>
>   0.12    4.736911          92     51379           gettimeofday
>
>   0.08    3.173336          12    254162           clock_gettime
>
>   0.06    2.234660        4373       511           poll
>
> ...
>
>
>
> I don't understand whether threads that are simply blocking are counted
> (in which case this is a worthless measure) or not.
>
>
>
> I ran jvmtop to get some runtime data out of one of the topologies (well,
> a few, but they were all roughly the same):
>
>
>
>   TID   NAME                                    STATE    CPU  TOTALCPU
> BLOCKEDBY
>
>     203 Thread-27-$mastercoord-bg0-exe       RUNNABLE 57.55%     8.54%
>
>
>     414 RMI TCP Connection(3)-172.17.0       RUNNABLE  8.03%     0.01%
>
>
>      22 disruptor-flush-trigger         TIMED_WAITING  3.79%     4.49%
>
>
>      51 heartbeat-timer                 TIMED_WAITING  0.80%     1.66%
>
>
>     328 disruptor-flush-task-pool       TIMED_WAITING  0.61%     0.84%
>
> ...
>
> So just about all of the time is spent inside the trident master
> coordinator.
>
> My theory is that the ridiculous thread count is causing the kernel to
> work extra-hard on all those futex calls (e.g. when waking a thread
> blocking on a futex).
>
> I'm very uncertain regarding this, the best I can say is that the overhead
> is related more to the number of topologies than to what the topology is
> doing (when running 1 topology on the same worker, cpu use is less than
> 1/10th of what it is with 10 topologies), and there are a *lot* of
> threads on the system (>3000).
>
> Any advice, suggestions for additional diagnostics, or ideas as to the
> cause? Operationally, we're planning moving to smaller instances with less
> slots per worker to work around this issue, but I'd rather resolve it
> without changing our cluster's makeup entirely.
>
> Thanks,
>
> Roee
>
>
>
>
>

Reply via email to