[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587624#comment-16587624 ]
Benedict commented on CASSANDRA-13651: -------------------------------------- So, I assume we're now defaulting to epoll in most cases, and this behaviour comes from a period where this wasn't the default (and was probably poorly justified at the time - AFAICR we used to benchmark with only a single connection, where this behaviour would be more beneficial). It's a shame we no longer have any standard benchmarking tools for the project, but it seems we have multiple data points demonstrating a win (or no loss), and the code is simpler after the patch. So, I'm +1 on the patch. I will get a circleci run going shortly. > Large amount of CPU used by epoll_wait(.., .., .., 0) > ----------------------------------------------------- > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug > Reporter: Corentin Chary > Assignee: Corentin Chary > Priority: Major > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x00000047, events: 0x7f5588c0c000, maxevents: 0x00002000, > timeout: 0x00000000 > ▒ > 5.77% epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, > timeout: 0x00000000 > ▒ > 1.98% epfd: 0x000000b5, events: 0x7fca419ef000, maxevents: 0x00001000, > timeout: 0x000003e8 > ▒ > 0.04% epfd: 0x00000003, events: 0x2f6af77b9c00, maxevents: 0x00000020, > timeout: 0x00000000 > ▒ > 0.04% epfd: 0x0000002b, events: 0x121ebf63ac00, maxevents: 0x00000040, > timeout: 0x00000000 > ▒ > 0.03% epfd: 0x00000026, events: 0x7f51f80019c0, maxevents: 0x00000020, > timeout: 0x00000000 > ▒ > 0.02% epfd: 0x00000003, events: 0x7fe4d80019d0, maxevents: 0x00000020, > timeout: 0x00000000 > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # ........ ........ ........ ........ > .................................................................................... > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x000000a7, events: > 0x7fca452d6000, maxevents: 0x00001000, timeout: 0x00000000 > | > ---0x1000200af313 > | > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x000000a7, events: > 0x7fca452d6000, maxevents: 0x00001000, timeout: 0x000003e8 > | > ---0x1000200af313 > 0x7fca6117b830 > 0x7fca60459804 > epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+epoll only support timeouts above the milliseconds and will convert > everything bellow to 0. > I added some traces to netty (4.1): > {code} > diff --git > a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java > > b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java > index 909088fde..8734bbfd4 100644 > --- > a/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java > +++ > b/transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java > @@ -208,10 +208,15 @@ final class EpollEventLoop extends > SingleThreadEventLoop { > long currentTimeNanos = System.nanoTime(); > long selectDeadLineNanos = currentTimeNanos + > delayNanos(currentTimeNanos); > for (;;) { > - long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + > 500000L) / 1000000L; > + long timeoutNanos = selectDeadLineNanos - currentTimeNanos + > 500000L; > + long timeoutMillis = timeoutNanos / 1000000L; > + System.out.printf("timeoutNanos: %d, timeoutMillis: %d | > deadline: %d - now: %d | hastask: %d\n", > + timeoutNanos, timeoutMillis, > + selectDeadLineNanos, currentTimeNanos, hasTasks() ? 1 : > 0); > if (timeoutMillis <= 0) { > if (selectCnt == 0) { > int ready = Native.epollWait(epollFd.intValue(), events, > 0); > + System.out.printf("ready: %d\n", ready); > if (ready > 0) { > return ready; > } > {code} > And this gives : > {code} > timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782341816510 - > now: 2001781341816510 | hastask: 0 > timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342087239 - > now: 2001781342087239 | hastask: 0 > timeoutNanos: 1000500000, timeoutMillis: 1000 | deadline: 2001782342166947 - > now: 2001781342166947 | hastask: 0 > timeoutNanos: 508459, timeoutMillis: 0 | deadline: 2001781342297987 - now: > 2001781342289528 | hastask: 0 > ready: 0 > timeoutNanos: 508475, timeoutMillis: 0 | deadline: 2001781342357719 - now: > 2001781342349244 | hastask: 0 > ready: 0 > timeoutNanos: 509327, timeoutMillis: 0 | deadline: 2001781342394822 - now: > 2001781342385495 | hastask: 0 > ready: 0 > timeoutNanos: 509339, timeoutMillis: 0 | deadline: 2001781342430192 - now: > 2001781342420853 | hastask: 0 > ready: 0 > timeoutNanos: 509510, timeoutMillis: 0 | deadline: 2001781342461588 - now: > 2001781342452078 | hastask: 0 > ready: 0 > timeoutNanos: 509493, timeoutMillis: 0 | deadline: 2001781342495044 - now: > 2001781342485551 | hastask: 0 > ready: 0 > {code} > The nanosecond timeout all come from {{eventLoop.schedule(this, 10000, > TimeUnit.NANOSECONDS);}} in {{Message::Flusher}}. > Knowing that, I'm not sure what would be best to do, and I have a hard time > understanding Message::Flusher, but to me it looks like trying to schedule > less tasks would probably help and I didn't think anything obvious that could > be done with netty. > Changing {{if (++runsWithNoWork > 5)}} to 2 seems to help a little bit, but > that isn't really significant. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org