[
https://issues.apache.org/jira/browse/KAFKA-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649546#comment-16649546
]
Jun Rao commented on KAFKA-7504:
[~kawamuray], thanks for the jira. Great find. Simple, but yet effective
solution. Very useful.
> Broker performance degradation caused by call of sendfile reading disk in
> network thread
>
>
> Key: KAFKA-7504
> URL: https://issues.apache.org/jira/browse/KAFKA-7504
> Project: Kafka
> Issue Type: Improvement
> Components: core
>Affects Versions: 0.10.2.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>Priority: Major
> Labels: latency, performance
> Attachments: image-2018-10-14-14-18-38-149.png,
> image-2018-10-14-14-18-57-429.png, image-2018-10-14-14-19-17-395.png,
> image-2018-10-14-14-19-27-059.png, image-2018-10-14-14-19-41-397.png,
> image-2018-10-14-14-19-51-823.png, image-2018-10-14-14-20-09-822.png,
> image-2018-10-14-14-20-19-217.png, image-2018-10-14-14-20-33-500.png,
> image-2018-10-14-14-20-46-566.png, image-2018-10-14-14-20-57-233.png
>
>
> h2. Environment
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Kafka version: 0.10.2.1, 0.11.1.2 (but reproduces with latest build from
> trunk (2.2.0-SNAPSHOT)
> h2. Phenomenon
> Response time of Produce request (99th ~ 99.9th %ile) degrading to 50x ~ 100x
> more than usual.
> Normally 99th %ile is lower than 20ms, but when this issue occurs it marks
> 50ms to 200ms.
> At the same time we could see two more things in metrics:
> 1. Disk read coincidence from the volume assigned to log.dirs.
> 2. Raise in network threads utilization (by
> `kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`)
> As we didn't see increase of requests in metrics, we suspected blocking in
> event loop ran by network thread as the cause of raising network thread
> utilization.
> Reading through Kafka broker source code, we understand that the only disk
> IO performed in network thread is reading log data through calling
> sendfile(2) (via FileChannel#transferTo).
> To probe that the calls of sendfile(2) are blocking network thread for some
> moments, I ran following SystemTap script to inspect duration of sendfile
> syscalls.
> {code:java}
> # Systemtap script to measure syscall duration
> global s
> global records
> probe syscall.$1 {
> s[tid()] = gettimeofday_us()
> }
> probe syscall.$1.return {
> elapsed = gettimeofday_us() - s[tid()]
> delete s[tid()]
> records <<< elapsed
> }
> probe end {
> print(@hist_log(records))
> }{code}
> {code:java}
> $ stap -v syscall-duration.stp sendfile
> # value (us)
> value | count
> 0 | 0
> 1 |71
> 2 |@@@ 6171
>16 |@@@ 29472
>32 |@@@ 3418
> 2048 | 0
> ...
> 8192 | 3{code}
> As you can see there were some cases taking more than few milliseconds,
> implies that it blocks network thread for that long and applying the same
> latency for all other request/response processing.
> h2. Hypothesis
> Gathering the above observations, I made the following hypothesis.
> Let's say network-thread-1 multiplexing 3 connections.
> - producer-A
> - follower-B (broker replica fetch)
> - consumer-C
> Broker receives requests from each of those clients, [Produce, FetchFollower,
> FetchConsumer].
> They are processed well by request handler threads, and now the response
> queue of the network-thread contains 3 responses in following order:
> [FetchConsumer, Produce, FetchFollower].
> network-thread-1 takes 3 responses and processes them sequentially
> ([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/network/SocketServer.scala#L632]).
> Ideally processing of these 3 responses completes in microseconds as in it
> just copies ready responses into client socket's buffer with non-blocking
> manner.
> However, Kafka uses sendfile(2) for transferring log data to client sockets.
> The target data might be in page cache, but old data which has written a bit
> far ago and never read since then, are likely not.
> If the target data isn't in page cache, kernel first needs to load the
> target page into cache. This takes more than few milliseconds to tens of
> milliseconds depending on disk hardware and current load being applied to it.
> Linux kernel doesn't considers the moment loading data from disk into page
> cache as "blocked", hence it awaits completion of target data loading rather
> than